US-East C1 Outage Postmortem: September 30th

Over the last couple days the US-East-C1 cluster experienced some performance issues, the worst of which occurred on September 30th between 8:36am and 8:44am (PST), when latencies were bad enough that several customers apps were severely affected. We’re very sorry to all affected customers and want to explain what happened and what we’re doing to prevent such incidents in the future.

The Unreliability of SMS

Have you ever sent a friend an SMS and it took them longer then seems reasonable to respond? That’s basically what happened to us on September 30th.

For the most part, MemCachier is engineered to recover gracefully when one of the nodes in a cluster is performing very poorly. However, in some cases one of the the support engineer on call needs to get involved–for example, when the performance is worse than normal within what seems like regular network jitter, or when performance issues in one part of the cluster lead to a thundering hurd of clients to another part of the cluster.

In these rare cases, the engineer’s ability to response promptly is obviously very important.

The incident on September 30th was one of these rare cases. Two backend machines in the cluster starting seeing very high latency, which affected the several proxy servers which mediate between clients and those those machines. Our monitoring systems picked up on the issue and alerted the engineer on call.

Unfortunately, the alert was over SMS and the engineer didn’t receive it for another 10 minutes. This meant the incident lasted long enough and escalated in impact before the engineer could resolve, that customers were negatively affected.

Future Steps

We’re in the progress of moving away from SMS for alerting our support engineers. We are switching to using XMPP (aka Jabber, and the same protocol used by messaging apps like WhatsApp). This will provide better reliability, faster delivery, and most importantly, allow our monitoring systems to (sooner) know if the person on-call was reached or the incident needs to be escalated.

Once again, we’re very sorry to all affected customers.