Trouble in US-East - Postmortem
Around 3pm PST on Monday, we started seeing very high connection counts to two proxy servers in one of the clusters in US-East. Normally, this correlates with either a poor network connection that causes lots of connection timeouts and re-connections from clients, or a handful of clients making an abnormal number of connections. In those cases we can simply replace an affected machine or reroute traffic to balance the load out.
Unfortunately, we were not able to attribute the drastic rise in connection count to either poor performance of a machine or heavy load from any particular subset of clients. We did see abnormal network behavior, with most latency falling in the usual ~0.3ms range between machines, but around 10-20% of requests were in the 5-15ms range. This, however, was across all machines.
Sporadically, the high number of connections moved from one proxy to another in the cluster. We replaced several of the proxy machines and managed to restore stable operation to that cluster within about 30 minutes.
However, shortly afterwards, two different clusters began exhibiting similar behaviors. These clusters are isolated: they don’t share any resources and are in different availability zones. Similarly, there were no obvious signs of issues other than some minor increases in network latency. Network bandwidth, CPU load, etc, was all normal and we haven’t deployed new software to these servers.
We eventually replaced nearly all of the proxy machines and some of the backend machines in US-East. We also increased the total number of proxies in each cluster.
In all, the incidents last until around 11pm PST on Wednesday morning. While service for most customers returned to normal within an hour, many of you were affected sporadically or continuously throughout the entire incident.
What caused this?
We believe it was two issues. First, the network started behaving
abnormally, with latency for a small subset of packets going from
0.3ms to 5-15ms and dropping a portion of packets. This was occurring
across the board, not just to a few machines. Second, a number of ways
in which our system responds to failure that actually amplified this
particularly rare issue. For example, we log connection errors between
servers. During a widespread outage like the one that occurred, this
means millions of log entries. Worse, we log the the struct that
identifies a failed server, which includes a list of outstanding
requests to that server. Similarly, we continuously record the
connection count for each server using the
lsof command, which isn’t
particularly efficient. Typically, this isn’t a problem, but when
connection counts are unusually high, it doesn’t help an already
overloaded CPU trying frantically to free unused file descriptors.
We also made human errors that hurt a subset of clients. In addition to launching new servers and migrating load, we added IP filtering rules to help mitigate the load temporarily. This helped, but meant that new connections from some legitimate clients were blocked.
What’s high availability?
Many of you reached out to support for clarification on what our kind of availability our high-availability plans mean. High-availability today means you have access to more than one proxy in the same cluster. Should one fail, the other is available. Importantly, each proxy has the same view of the underlying cache, so a transient network failure or individual server crashes can be masked by your client library without losing access to any portion of your cache. This works well in the vast majority of cases, where individual server fail or experience high network latency. However, it doesn’t help in cases like this one, where entire clusters were affected.
Moving forward we’re investigating if and how to offer plans that span multiple clusters. If you have suggestions, please email them to email@example.com.
First, we are completely revising how we do incident response. We are establishing codified procedures to step through for all incident types now, including cluster wide failures like the one this week.
Second, we are fixing issues in our systems that amplified the underlying issue and reviewing other technical ways to prevent issues propagating through the cluster.
Third, we’ve normally tried to skew towards mitigating cache loss and resolving network issues as quickly as possible by re-balancing load and replacing individual machines in a cluster. While we will still do this for most incidents, we are putting in place a much more aggressive course of action to take when issues actually start affecting customers. In particular, starting today, we will skew towards resolving connectivity issues as fast as possible at the expense of flushing parts of the cache.
We’ve already deployed changes to internal tools that make this process quicker and more automated and we’re working on additional changes to our monitoring infrastructure the will help identify cases like this one that require more drastic measures.
We are also improving our tooling around communication. This was the worst outage we’ve ever had, and resulted in a breakdown of how we communicate with customers. We’re going to integrate our status page, Twitter account and monitoring systems so that we can inform our customers of the ongoing issues much more quickly and seamlessly.
We are sorry for the outage and the impact it had on our customers. This is the worst one we’ve had in almost 5 years of operation. We’d appreciate your feedback, even when it’s negative. Our doors are always open at firstname.lastname@example.org.