Summary of the recent failures on the production cluster – February 22nd, 2013

At 3:36pm PST we had a large number of failures across the board with our production cluster. We quickly found that a single machine (MC4) was responsible for this. We run a distributed cluster that scales extremely well but also ties many machines together such that a single failing machine can cause unresponsiveness though out a lot of the cluster. We are actively working on strategies to manage this coupling better.

At 3:40pm MC4 had been largely isolated from the rest of the cluster, bringing all other machines to full performance. We then migrate all customers of MC4 over the next few minutes, completing this operating by 3:44pm PST.

The issue with MC4 is not completely known yet. Its responsiveness to ICMP (ping) packets went from the expected 0.3ms range to 500ms, already at the limit of timeout of memcache. Nothing was changed by us on that machine anytime prior to the incident and load was fine. So we suspect an issue with Amazon’s network or underlying hardware at this point. We will continue to investigate and let you know as we know more.

We apologize to all affected customers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s