Some of our customers in Amazon’s us-east-1 region (Virginia) noticed degraded performance and partial outages starting Friday, 11/16 through Sunday, 11/18, and again on Wednesday, 11/21. We want to take this opportunity to explain what happened.
Friday, 11/16 through Sunday, 11/18
On Friday we started noticing what looked like a DDoS attack. Two of our machines would be totally fine one second, then instantly would be hit with 16,000 TCP connections. The spike in TCP connections caused new TCP connections to be rejected, which resulted in less performance for customers who attempted to create new TCP connections at that time. Most memcache clients use persistent TCP connections, so most of our customers didn’t experience downtime. However, because the cluster was so overloaded, some customers experienced slower performance. This largely affected our proxy servers (the servers customers actually connect to), so only a subset of customers were affected as most of the cluster continued operating normally.
By Friday evening the attacks had subsided and we were able to breathe a little and take a close look at what was happening. Upon inspection, we noticed that the majority of these 16,000 TCP connections were in a CLOSE_WAIT state, meaning the client had closed the connection, but our server was still holding on to the file descriptor. Upon further investigation, we found that the degraded performance was causing memcache clients to timeout, causing them to retry a connection, which created a new TCP connection. This snowballed and resulted in an effective DDoS attack from our customers. The degraded performance was due to a large customer who experienced a massive traffic spike.
We have implemented a few measures to help prevent one customer from degrading other customer’s performance. We’ve also booted several more high-CPU, high-IO servers to help spread the load. We’ve also improved our status page to be more informative and useful, with more changes coming soon.
This outage was caused by a misbehaving machine. We had recently turned on a new machine (mc6.ec2.memcachier.com) to expand the cluster. Part of our provisioning process is running tests on new the machine to ensure that it’s operating normally. However, our tests didn’t catch an issue with mc6. mc6 had an abnormally slow network connection, which caused the rest of the cluster to respond slowly as well. mc6 contains data nodes, meaning other servers such as mc1 will make requests to it to retrieve data. Being at the lowest level of a (shallow) distributed architecture, this caused the slow response times to ripple though the rest of the cluster. This slowness caused many memcache clients to timeout, forcing a new TCP connection, which overwhelmed all of our servers with new TCP connections. The end result was very similar to Friday but worse in that the whole cluster was affected for a short while.
We removed the misbehaving machine and the cluster immediately started operating as normal.
We’re deeply sorry for the outages, and we’re grateful for your patience. We understand our customers place a lot of trust in us: if we go down, often they go down, too. We take this responsibility very seriously.
We worked very hard over the weekend and Thanksgiving week to manage these problems and learn from them. We have several changes coming that will improve the performance and reliability of our service, allowing us to grow to the next level. We will announce them shortly once we start deploying them. Finally, we’ll continue to do everything we can to offer you the best memcache service available.