At 3:36pm PST we had a large number of failures across the board with our production cluster. We quickly found that a single machine (MC4) was responsible for this. We run a distributed cluster that scales extremely well but also ties many machines together such that a single failing machine can cause unresponsiveness though out a lot of the cluster. We are actively working on strategies to manage this coupling better.
At 3:40pm MC4 had been largely isolated from the rest of the cluster, bringing all other machines to full performance. We then migrate all customers of MC4 over the next few minutes, completing this operating by 3:44pm PST.
The issue with MC4 is not completely known yet. Its responsiveness to ICMP (ping) packets went from the expected 0.3ms range to 500ms, already at the limit of timeout of memcache. Nothing was changed by us on that machine anytime prior to the incident and load was fine. So we suspect an issue with Amazon’s network or underlying hardware at this point. We will continue to investigate and let you know as we know more.
We apologize to all affected customers.