Trouble in US-East – Postmortem

What happened?

Around 3pm PST on Monday, we started seeing very high connection counts to two proxy servers in one of the clusters in US-East. Normally, this correlates with either a poor network connection that causes lots of connection timeouts and re-connections from clients, or a handful of clients making an abnormal number of connections. In those cases we can simply replace an affected machine or reroute traffic to balance the load out.

Unfortunately, we were not able to attribute the drastic rise in connection count to either poor performance of a machine or heavy load from any particular subset of clients. We did see abnormal network behavior, with most latency falling in the usual ~0.3ms range between machines, but around 10-20% of requests were in the 5-15ms range. This, however, was across all machines.

Sporadically, the high number of connections moved from one proxy to another in the cluster. We replaced several of the proxy machines and managed to restore stable operation to that cluster within about 30 minutes.

However, shortly afterwards, two different clusters began exhibiting similar behaviors. These clusters are isolated: they don’t share any resources and are in different availability zones. Similarly, there were no obvious signs of issues other than some minor increases in network latency. Network bandwidth, CPU load, etc, was all normal and we haven’t deployed new software to these servers.

We eventually replaced nearly all of the proxy machines and some of the backend machines in US-East. We also increased the total number of proxies in each cluster.

In all, the incidents last until around 11pm PST on Wednesday morning. While service for most customers returned to normal within an hour, many of you were affected sporadically or continuously throughout the entire incident.

What caused this?

We believe it was two issues. First, the network started behaving abnormally, with latency for a small subset of packets going from 0.3ms to 5-15ms and dropping a portion of packets. This was occurring across the board, not just to a few machines. Second, a number of ways in which our system responds to failure that actually amplified this particularly rare issue. For example, we log connection errors between servers. During a widespread outage like the one that occurred, this means millions of log entries. Worse, we log the the struct that identifies a failed server, which includes a list of outstanding requests to that server. Similarly, we continuously record the connection count for each server using the lsof command, which isn’t particularly efficient. Typically, this isn’t a problem, but when connection counts are unusually high, it doesn’t help an already overloaded CPU trying frantically to free unused file descriptors.

We also made human errors that hurt a subset of clients. In addition to launching new servers and migrating load, we added IP filtering rules to help mitigate the load temporarily. This helped, but meant that new connections from some legitimate clients were blocked.

What’s high availability?

Many of you reached out to support for clarification on what our kind of availability our high-availability plans mean. High-availability today means you have access to more than one proxy in the same cluster. Should one fail, the other is available. Importantly, each proxy has the same view of the underlying cache, so a transient network failure or individual server crashes can be masked by your client library without losing access to any portion of your cache. This works well in the vast majority of cases, where individual server fail or experience high network latency. However, it doesn’t help in cases like this one, where entire clusters were affected.

Moving forward we’re investigating if and how to offer plans that span multiple clusters. If you have suggestions, please email them to support@memcachier.com.

What’s next?

First, we are completely revising how we do incident response. We are establishing codified procedures to step through for all incident types now, including cluster wide failures like the one this week.

Second, we are fixing issues in our systems that amplified the underlying issue and reviewing other technical ways to  prevent issues propagating through the cluster.

Third, we’ve normally tried to skew towards mitigating cache loss and resolving network issues as quickly as possible by re-balancing load and replacing individual machines in a cluster. While we will still do this for most incidents, we are putting in place a much more aggressive course of action to take when issues actually start affecting customers. In particular, starting today, we will skew towards resolving connectivity issues as fast as possible at the expense of flushing parts of the cache.

We’ve already deployed changes to internal tools that make this process quicker and more automated and we’re working on additional changes to our monitoring infrastructure the will help identify cases like this one that require more drastic measures.

We are also improving our tooling around communication. This was the worst outage we’ve ever had, and resulted in a breakdown of how we communicate with customers. We’re going to integrate our status page, Twitter account and monitoring systems so that we can inform our customers of the ongoing issues much more quickly and seamlessly.

We’re Sorry

We are sorry for the outage and the impact it had on our customers. This is the worst one we’ve had in almost 5 years of operation. We’d appreciate your feedback, even when it’s negative. Our doors are always open at support@memcachier.com.

US-East C1 Outage Postmortem: September 30th

Over the last couple days the US-East-C1 cluster experienced some performance issues, the worst of which occurred on September 30th between 8:36am and 8:44am (PST), when latencies were bad enough that several customers apps were severely affected. We’re very sorry to all affected customers and want to explain what happened and what we’re doing to prevent such incidents in the future.

The Unreliability of SMS

Have you ever sent a friend an SMS and it took them longer then seems reasonable to respond? That’s basically what happened to us on September 30th.

For the most part, MemCachier is engineered to recover gracefully when one of the nodes in a cluster is performing very poorly. However, in some cases one of the the support engineer on call needs to get involved–for example, when the performance is worse than normal within what seems like regular network jitter, or when performance issues in one part of the cluster lead to a thundering hurd of clients to another part of the cluster.

In these rare cases, the engineer’s ability to response promptly is obviously very important.

The incident on September 30th was one of these rare cases. Two backend machines in the cluster starting seeing very high latency, which affected the several proxy servers which mediate between clients and those those machines. Our monitoring systems picked up on the issue and alerted the engineer on call.

Unfortunately, the alert was over SMS and the engineer didn’t receive it for another 10 minutes.  This meant the incident lasted long enough and escalated in impact before the engineer could resolve, that customers were negatively affected.

Future Steps

We’re in the progress of moving away from SMS for alerting our support engineers. We are switching to using XMPP (aka Jabber, and the same protocol used by messaging apps like WhatsApp). This will provide better reliability, faster delivery, and most importantly, allow our monitoring systems to (sooner) know if the person on-call was reached or the incident needs to be escalated.

Once again, we’re very sorry to all affected customers.

Django persistent memcached connections

Recently we diagnosed an issue for several of our customers who were using the python Django web-framework. They were all experiencing performance issues and a reasonable amount of broken connections to our servers. The issue was that by default Django uses a new connection to memcached for each request. This is terrible for performance in general, paying the cost of an extra round-trip to setup the TCP connection, but is far worse with cloud-services like MemCachier that have an security layer and require new connections to be authenticated. A new connection to a MemCachier server takes at least 3 round-trips, which increases the time to execute a single get request by 4x.

It turned out this situation is the result of a very old and long-standing bug in Django (#11331). This bug arose due to a incorrect fix for another bug that Django had, one where it could continually create new connections to the memcached server without ever closing any of them until it eventually consumed all the available file descriptors at the server. Unfortunately this fix for this file descriptor leak was to simply close the memcached connection after all requests.

Thankfully we can fix this by monkey-patching Django and disabling the close method of the Memcached class. The best place to do this is in your Django wsgi.py file, inserting the code below before you create your application:

That is, the full file may look something like follows:

We’ve advised several customers to use this fix and they’ve had great results! We will soon be adding it to our docs as a recommended practice for Django.

Ubuntu 14.04, libmemcached and SASL support

Any Ubuntu 14.04 LTS users may have noticed that the provided libmemcached package doesn’t support SASL authentication. This is a major issue as it means that any memcache client that depends on libmemcached (which is a lot!) doesn’t work with MemCachier or similar services out-of-the-box as they can’t authenticate with our servers.

We are trying to do two things to address this:
1) Asking Ubuntu to build with SASL support (bug report)
2) Submitting a patch to libmemcached to remove the libsasl2 dependency and implement SASL directly so that it is always build with SASL support (patch)

UPDATE (Jan, 2016): As of Ubuntu 15.10, libmemcached is built with SASL support. We suggest all users update to 16.04 LTS Ubuntu. We haven’t made any progress on (2) sadly as the libmemcached maintainer is very unresponsive.

Until either of those two work out though, Ubuntu users will need to compile libmemcached themselves to get SASL support. Luckily this is very easy and we’ll show you how.

First, remove the existing libmemcached install if present:

$ sudo apt-get remove libmemcached-dev
$ sudo apt-get autoremove

Now, make sure libsasl2 is installed and the build tools you’ll need:

$ sudo apt-get install libsasl2-dev build-essential

Then, grab the latest source code for libmemcached and build it:

$ wget https://launchpad.net/libmemcached/1.0/1.0.18/+download/libmemcached-1.0.18.tar.gz
$ tar xzf libmemcached-1.0.18.tar.gz
$ cd libmemcached
$ ./configure --enable-sasl
$ make
$ sudo make install

This will just install in /usr/local, if you want you could install to /opt or similar instead but then you’ll need to tell any software that depends on libmemcached about this custom location, so we recommend just sticking with the default location.

You can now test this by compiling and trying to run the following C code:

Save it to a file (say sasl_test.c) and compile it as follows:

$ gcc -o sasl_test -O2 sasl_test.c -lmemcached -pthread

Then test that it works by simply creating a free development cache at MemCachier and using the provided username, password and server:

$ ./sasl_test user-233 password-233 mc5.dev.ec2.memcachier.com

It should work if libmemcached is setup correctly, and if not, it’ll print out where it went wrong.

Launching multiple username caches!

We’re happy to announce that we’re rolling out the first iteration of a new feature: multiple usernames for a single cache! This feature allows you to generate any number of new usernames and passwords that work against your current cache. They can later be deleted at any time, removing access to your cache through that username and password pair. This allows you to have different security zones use different passwords and to rotate your username and password if you feel any of your credentials are compromised. As multiple credentials can be active at once, this allows you to rotate credentials with no downtime.

This is also just the first iteration. Right now one limitation is that you can’t rotate the credentials that your cache is first created with. We’ll address this shortly but want to roll it out sooner and get feedback before then. We also plan to have read-only credentials in the future and potentially namespaced credentials.

To use this feature, you’ll first need to generate extra credentials. This is all controlled through your caches analytics dashboard.

1. Log-in to you analytics dashboard

Once there, you should now see two tabs just under the MemCachier top navigation banner. Click on the ‘Settings’ tab.

Screen_Shot_2014-08-05_at_1_15_07_PM

2. Create a new username + password pair

The settings page should have your core cache credentials, as well as a new table of any extra credentials you have that work for your cache. For the cache below there are two extra username + password pairs.

Screen Shot 2014-08-05 at 1.15.26 PM

Clicking the ‘Add new credentials’ button will create a new username + password pair, in the example below, creating the new username ‘dea9b6’.

Screen_Shot_2014-08-05_at_1_15_51_PM

New credentials will take up to 3 minutes to synchronize and after that should be useable.

3. Removing a username + password pair

To remove a username + password pair, simply click the ‘X’ button next to that pair. The image below shows the removal of the username ‘069d1b’.

Screen_Shot_2014-08-05_at_1_15_51_PM_copy

Enjoy!

That’s it! Please try it out and let us know what you think and where you’d like to see this go. As always, simply email us at support@memcachier.com, or comment on this blog post or through twitter (@memcachier).