MemCachier Launches on Manifold

We’ve recently launched MemCachier on Manifold! A new company that aims to bring the developer app-store to everyone, rather than have it tied to a single specific PaaS player.

We are really excited by this partnership with Manifold and the mission they are on. Please check them out and give MemCachier a spin through them.

ASCII Protocol Support

We’re happy to announce the general availability of a new feature on MemCachier, the ability to connect to our servers using the memcache ASCII protocol! Our official documentation covers it here.

Previously, this wasn’t supported as the ASCII protocol provides no standardized way to do authentication. We added our own simple scheme. When you establish a connection, the first command you must send to us is a set with your username as the key and your password as the value. For example:

$ telnet 11211

> set 15F38e 0 0 32
> 5

The STORED line indicates successful authentication. After this, you can issue all the familiar memcache ASCII protocol commands! You will need to send the set command fairly quickly though, as otherwise we timeout the connection.

One thing to be careful of is only running this over the internal network of the IaaS provider you are hosting MemCachier with (e.g., AWS EC2). Why? Those networks are secure against eavesdropping and so the unencrypted connection works fine. If running over the Internet, you’ll want to use our TLS support for securing the connection.

Beyond running ad-hoc queries yourself using telnet or similar, supporting the ASCII protocol also opens up the number of clients and frameworks you can use. Just make sure to modify them to send the set command for authentication on first connection. As an example, we created a django caching backend that uses the ASCII protocol but works with MemCachier!

Trouble in US-East – Postmortem

What happened?

Around 3pm PST on Monday, we started seeing very high connection counts to two proxy servers in one of the clusters in US-East. Normally, this correlates with either a poor network connection that causes lots of connection timeouts and re-connections from clients, or a handful of clients making an abnormal number of connections. In those cases we can simply replace an affected machine or reroute traffic to balance the load out.

Unfortunately, we were not able to attribute the drastic rise in connection count to either poor performance of a machine or heavy load from any particular subset of clients. We did see abnormal network behavior, with most latency falling in the usual ~0.3ms range between machines, but around 10-20% of requests were in the 5-15ms range. This, however, was across all machines.

Sporadically, the high number of connections moved from one proxy to another in the cluster. We replaced several of the proxy machines and managed to restore stable operation to that cluster within about 30 minutes.

However, shortly afterwards, two different clusters began exhibiting similar behaviors. These clusters are isolated: they don’t share any resources and are in different availability zones. Similarly, there were no obvious signs of issues other than some minor increases in network latency. Network bandwidth, CPU load, etc, was all normal and we haven’t deployed new software to these servers.

We eventually replaced nearly all of the proxy machines and some of the backend machines in US-East. We also increased the total number of proxies in each cluster.

In all, the incidents last until around 11pm PST on Wednesday morning. While service for most customers returned to normal within an hour, many of you were affected sporadically or continuously throughout the entire incident.

What caused this?

We believe it was two issues. First, the network started behaving abnormally, with latency for a small subset of packets going from 0.3ms to 5-15ms and dropping a portion of packets. This was occurring across the board, not just to a few machines. Second, a number of ways in which our system responds to failure that actually amplified this particularly rare issue. For example, we log connection errors between servers. During a widespread outage like the one that occurred, this means millions of log entries. Worse, we log the the struct that identifies a failed server, which includes a list of outstanding requests to that server. Similarly, we continuously record the connection count for each server using the lsof command, which isn’t particularly efficient. Typically, this isn’t a problem, but when connection counts are unusually high, it doesn’t help an already overloaded CPU trying frantically to free unused file descriptors.

We also made human errors that hurt a subset of clients. In addition to launching new servers and migrating load, we added IP filtering rules to help mitigate the load temporarily. This helped, but meant that new connections from some legitimate clients were blocked.

What’s high availability?

Many of you reached out to support for clarification on what our kind of availability our high-availability plans mean. High-availability today means you have access to more than one proxy in the same cluster. Should one fail, the other is available. Importantly, each proxy has the same view of the underlying cache, so a transient network failure or individual server crashes can be masked by your client library without losing access to any portion of your cache. This works well in the vast majority of cases, where individual server fail or experience high network latency. However, it doesn’t help in cases like this one, where entire clusters were affected.

Moving forward we’re investigating if and how to offer plans that span multiple clusters. If you have suggestions, please email them to

What’s next?

First, we are completely revising how we do incident response. We are establishing codified procedures to step through for all incident types now, including cluster wide failures like the one this week.

Second, we are fixing issues in our systems that amplified the underlying issue and reviewing other technical ways to  prevent issues propagating through the cluster.

Third, we’ve normally tried to skew towards mitigating cache loss and resolving network issues as quickly as possible by re-balancing load and replacing individual machines in a cluster. While we will still do this for most incidents, we are putting in place a much more aggressive course of action to take when issues actually start affecting customers. In particular, starting today, we will skew towards resolving connectivity issues as fast as possible at the expense of flushing parts of the cache.

We’ve already deployed changes to internal tools that make this process quicker and more automated and we’re working on additional changes to our monitoring infrastructure the will help identify cases like this one that require more drastic measures.

We are also improving our tooling around communication. This was the worst outage we’ve ever had, and resulted in a breakdown of how we communicate with customers. We’re going to integrate our status page, Twitter account and monitoring systems so that we can inform our customers of the ongoing issues much more quickly and seamlessly.

We’re Sorry

We are sorry for the outage and the impact it had on our customers. This is the worst one we’ve had in almost 5 years of operation. We’d appreciate your feedback, even when it’s negative. Our doors are always open at

US-East C1 Outage Postmortem: September 30th

Over the last couple days the US-East-C1 cluster experienced some performance issues, the worst of which occurred on September 30th between 8:36am and 8:44am (PST), when latencies were bad enough that several customers apps were severely affected. We’re very sorry to all affected customers and want to explain what happened and what we’re doing to prevent such incidents in the future.

The Unreliability of SMS

Have you ever sent a friend an SMS and it took them longer then seems reasonable to respond? That’s basically what happened to us on September 30th.

For the most part, MemCachier is engineered to recover gracefully when one of the nodes in a cluster is performing very poorly. However, in some cases one of the the support engineer on call needs to get involved–for example, when the performance is worse than normal within what seems like regular network jitter, or when performance issues in one part of the cluster lead to a thundering hurd of clients to another part of the cluster.

In these rare cases, the engineer’s ability to response promptly is obviously very important.

The incident on September 30th was one of these rare cases. Two backend machines in the cluster starting seeing very high latency, which affected the several proxy servers which mediate between clients and those those machines. Our monitoring systems picked up on the issue and alerted the engineer on call.

Unfortunately, the alert was over SMS and the engineer didn’t receive it for another 10 minutes.  This meant the incident lasted long enough and escalated in impact before the engineer could resolve, that customers were negatively affected.

Future Steps

We’re in the progress of moving away from SMS for alerting our support engineers. We are switching to using XMPP (aka Jabber, and the same protocol used by messaging apps like WhatsApp). This will provide better reliability, faster delivery, and most importantly, allow our monitoring systems to (sooner) know if the person on-call was reached or the incident needs to be escalated.

Once again, we’re very sorry to all affected customers.

Django persistent memcached connections

Recently we diagnosed an issue for several of our customers who were using the python Django web-framework. They were all experiencing performance issues and a reasonable amount of broken connections to our servers. The issue was that by default Django uses a new connection to memcached for each request. This is terrible for performance in general, paying the cost of an extra round-trip to setup the TCP connection, but is far worse with cloud-services like MemCachier that have an security layer and require new connections to be authenticated. A new connection to a MemCachier server takes at least 3 round-trips, which increases the time to execute a single get request by 4x.

It turned out this situation is the result of a very old and long-standing bug in Django (#11331). This bug arose due to a incorrect fix for another bug that Django had, one where it could continually create new connections to the memcached server without ever closing any of them until it eventually consumed all the available file descriptors at the server. Unfortunately this fix for this file descriptor leak was to simply close the memcached connection after all requests.

Thankfully we can fix this by monkey-patching Django and disabling the close method of the Memcached class. The best place to do this is in your Django file, inserting the code below before you create your application:

That is, the full file may look something like follows:

We’ve advised several customers to use this fix and they’ve had great results! We will soon be adding it to our docs as a recommended practice for Django.