On the 12th of March between 2am and 11am PST MemCachier suffered a serries of outages and performance problems for many of our production customers. This post briefly summarizes the cause of the problem and the actions that have been taken since to address it.
Firstly, at around 2:11am PST we had a surge on one of our largest clusters of demand for memory and computing resources. This was largely caused by two fairly large customers signing up and using their cache for the first time. While this should have been fine, there was an unknown timing bug in the communication between the cluster and the provisioner. The provisioner is the server that collects resource usage statistics from the cluster and makes decisions about provisioning new machines and also some simple scheduling / migration decisions to distribute demand efficiently. The bug was related to a problem with the interaction between needing to provision new machines to handle extra demand, while simultaneously one of our virtual machines was failing (greatly degraded performance) due to issues with Amazon EC2 that were outside our control.
The end result was the provisioner getting into a confused state and making the incorrect decision to not bring new machines into the cluster quick enough. As a result a few of the machines had processes on them that ran out of memory and restarted. This is completely unacceptable but in it self would not cause a large impact to many of our customers as most of them treat MemCachier correctly as a cache and so data loss can be withstood. However, as is generally the case with such systems and outages, this behaviour triggered a recently introduced issue elsewhere in the system.
The second, follow on issue was that the then deployed version of the proxy layer of MemCachier — the layer that manages communication from one front-end server to all backends — didn’t handle restarted backend correctly. While it normally detected sockets that were broken and removed them from its known list of servers, the latest version had introduced a bug in that code, resulting in broken sockets to now non-existent backend process staying around. This meant, a request would sometimes silently fail as the proxy was trying to send it to a broken socket. This second issue caused the bulk of the problems that customers experienced as our monitoring infrastructure wasn’t well enough equipped to detect it properly. Now, a request would or wouldn’t work depending on the key since that affects what backend node is chosen. Our range of tested keys picked up some of these broken connections but not all of them and the silent error behaviour muddled it further.
The result of all of this was the most serious problem we have faced yet and the first one that resulted simply from mistakes by us and not problems flowing on from the hardware and network layers. We’d like to apologise deeply to all our customers for this. This is unacceptable and we are ashamed by it.
We have taken a number of actions since the problem occurred to improve our processes to ensure such issues don’t occur in the future. These include:
- The entire code base has been reviewed and improved with a lot of detail going into how components inter-connect. We have greatly improved our testing of these interactions and documented various invariants and properties we rely on.
- We’ve spent the last two weeks improving our testing infrastructure and monitoring systems. A large part of this includes the ability to reply recorded (real) production data and synthesized data that model various scenarios.
- We’ve formalized our release process. While one previously existed, it had evolved over time and the culture around it wasn’t careful enough. Various checkpoints and reviews now exist to ensure no such bug make it into future production releases.
We have made some great changes in the last two weeks but these are just the beginnings of a process of continual improvement to our code, culture and practices. We happy and grateful to have so many wonderful users, thank you.