US-East C1 Outage Postmortem: September 30th

Over the last couple days the US-East-C1 cluster experienced some performance issues, the worst of which occurred on September 30th between 8:36am and 8:44am (PST), when latencies were bad enough that several customers apps were severely affected. We’re very sorry to all affected customers and want to explain what happened and what we’re doing to prevent such incidents in the future.

The Unreliability of SMS

Have you ever sent a friend an SMS and it took them longer then seems reasonable to respond? That’s basically what happened to us on September 30th.

For the most part, MemCachier is engineered to recover gracefully when one of the nodes in a cluster is performing very poorly. However, in some cases one of the the support engineer on call needs to get involved–for example, when the performance is worse than normal within what seems like regular network jitter, or when performance issues in one part of the cluster lead to a thundering hurd of clients to another part of the cluster.

In these rare cases, the engineer’s ability to response promptly is obviously very important.

The incident on September 30th was one of these rare cases. Two backend machines in the cluster starting seeing very high latency, which affected the several proxy servers which mediate between clients and those those machines. Our monitoring systems picked up on the issue and alerted the engineer on call.

Unfortunately, the alert was over SMS and the engineer didn’t receive it for another 10 minutes.  This meant the incident lasted long enough and escalated in impact before the engineer could resolve, that customers were negatively affected.

Future Steps

We’re in the progress of moving away from SMS for alerting our support engineers. We are switching to using XMPP (aka Jabber, and the same protocol used by messaging apps like WhatsApp). This will provide better reliability, faster delivery, and most importantly, allow our monitoring systems to (sooner) know if the person on-call was reached or the incident needs to be escalated.

Once again, we’re very sorry to all affected customers.

Django persistent memcached connections

Recently we diagnosed an issue for several of our customers who were using the python Django web-framework. They were all experiencing performance issues and a reasonable amount of broken connections to our servers. The issue was that by default Django uses a new connection to memcached for each request. This is terrible for performance in general, paying the cost of an extra round-trip to setup the TCP connection, but is far worse with cloud-services like MemCachier that have an security layer and require new connections to be authenticated. A new connection to a MemCachier server takes at least 3 round-trips, which increases the time to execute a single get request by 4x.

It turned out this situation is the result of a very old and long-standing bug in Django (#11331). This bug arose due to a incorrect fix for another bug that Django had, one where it could continually create new connections to the memcached server without ever closing any of them until it eventually consumed all the available file descriptors at the server. Unfortunately this fix for this file descriptor leak was to simply close the memcached connection after all requests.

Thankfully we can fix this by monkey-patching Django and disabling the close method of the Memcached class. The best place to do this is in your Django wsgi.py file, inserting the code below before you create your application:

That is, the full file may look something like follows:

We’ve advised several customers to use this fix and they’ve had great results! We will soon be adding it to our docs as a recommended practice for Django.

Ubuntu 14.04, libmemcached and SASL support

Any Ubuntu 14.04 LTS users may have noticed that the provided libmemcached package doesn’t support SASL authentication. This is a major issue as it means that any memcache client that depends on libmemcached (which is a lot!) doesn’t work with MemCachier or similar services out-of-the-box as they can’t authenticate with our servers.

We are trying to do two things to address this:
1) Asking Ubuntu to build with SASL support (bug report)
2) Submitting a patch to libmemcached to remove the libsasl2 dependency and implement SASL directly so that it is always build with SASL support (patch)

UPDATE (Jan, 2016): As of Ubuntu 15.10, libmemcached is built with SASL support. We suggest all users update to 16.04 LTS Ubuntu. We haven’t made any progress on (2) sadly as the libmemcached maintainer is very unresponsive.

Until either of those two work out though, Ubuntu users will need to compile libmemcached themselves to get SASL support. Luckily this is very easy and we’ll show you how.

First, remove the existing libmemcached install if present:

$ sudo apt-get remove libmemcached-dev
$ sudo apt-get autoremove

Now, make sure libsasl2 is installed and the build tools you’ll need:

$ sudo apt-get install libsasl2-dev build-essential

Then, grab the latest source code for libmemcached and build it:

$ wget https://launchpad.net/libmemcached/1.0/1.0.18/+download/libmemcached-1.0.18.tar.gz
$ tar xzf libmemcached-1.0.18.tar.gz
$ cd libmemcached
$ ./configure --enable-sasl
$ make
$ sudo make install

This will just install in /usr/local, if you want you could install to /opt or similar instead but then you’ll need to tell any software that depends on libmemcached about this custom location, so we recommend just sticking with the default location.

You can now test this by compiling and trying to run the following C code:

Save it to a file (say sasl_test.c) and compile it as follows:

$ gcc -o sasl_test -O2 sasl_test.c -lmemcached -pthread

Then test that it works by simply creating a free development cache at MemCachier and using the provided username, password and server:

$ ./sasl_test user-233 password-233 mc5.dev.ec2.memcachier.com

It should work if libmemcached is setup correctly, and if not, it’ll print out where it went wrong.

Launching multiple username caches!

We’re happy to announce that we’re rolling out the first iteration of a new feature: multiple usernames for a single cache! This feature allows you to generate any number of new usernames and passwords that work against your current cache. They can later be deleted at any time, removing access to your cache through that username and password pair. This allows you to have different security zones use different passwords and to rotate your username and password if you feel any of your credentials are compromised. As multiple credentials can be active at once, this allows you to rotate credentials with no downtime.

This is also just the first iteration. Right now one limitation is that you can’t rotate the credentials that your cache is first created with. We’ll address this shortly but want to roll it out sooner and get feedback before then. We also plan to have read-only credentials in the future and potentially namespaced credentials.

To use this feature, you’ll first need to generate extra credentials. This is all controlled through your caches analytics dashboard.

1. Log-in to you analytics dashboard

Once there, you should now see two tabs just under the MemCachier top navigation banner. Click on the ‘Settings’ tab.

Screen_Shot_2014-08-05_at_1_15_07_PM

2. Create a new username + password pair

The settings page should have your core cache credentials, as well as a new table of any extra credentials you have that work for your cache. For the cache below there are two extra username + password pairs.

Screen Shot 2014-08-05 at 1.15.26 PM

Clicking the ‘Add new credentials’ button will create a new username + password pair, in the example below, creating the new username ‘dea9b6’.

Screen_Shot_2014-08-05_at_1_15_51_PM

New credentials will take up to 3 minutes to synchronize and after that should be useable.

3. Removing a username + password pair

To remove a username + password pair, simply click the ‘X’ button next to that pair. The image below shows the removal of the username ‘069d1b’.

Screen_Shot_2014-08-05_at_1_15_51_PM_copy

Enjoy!

That’s it! Please try it out and let us know what you think and where you’d like to see this go. As always, simply email us at support@memcachier.com, or comment on this blog post or through twitter (@memcachier).

MemCachier launches on Windows Azure Store with Discount Pricing!

Today we officially launched on the Windows Azure Store! We’ve supported Windows Azure with MemCachier for a few weeks now but weren’t integrated into their store and management interface. After some hard work to upgrade our Haskell provisioning server, we’re now have full support for Windows Azure.

As part of this launch we’re offering great initial pricing for all customers. This pricing should be in effect until at least the end of the year, so great time now to take advantage of it. The exact pricing (in US dollars) is:

  • Developer (25MB) Cache – Free!
  • Developer (100MB) Cache – $10
  • Developer (250MB) Cache – $20
  • Developer (500MB) Cache – $30
  • Basic (1GB) Cache – $50
  • Basic (2.5GB) Cache – $100
  • Basic (5GB) Cache – $200
  • Basic (7.5GB) Cache – $300
  • Basic (10GB) Cache – $400
  • Basic (20GB) Cache – $800

To signup and manage MemCachier through Windows Azure, simply follow these steps.

1. Find the Windows Azure Store
Log on to Windows Azure and then click on the ‘Add’ button in the bottom left corner.

add-bar2

Next, you’ll want to click on the ‘Store’ option from the add menu.

new-menu

2. Choose MemCachier from the Store

Once in the store menu, you’ll have a long list of services to choose from. Simply scroll down to MemCachier to add the service.

addon-menu

3. Choose your MemCachier plan

Now you’ll have the choice of which MemCachier plan you want. Descriptions are given, but please refer to our plan information and service overview pages for detailed information.

Right now we have introductory pricing on Azure that offers great value!

plan-selection

Once your plan is selected, you can click next and you’ll have MemCachier up and running!

4. Setting up MemCachier

Now that MemCachier is provisioned, you’ll be provided with three configuration variables. MEMCACHIER_SERVERS, MEMCACHIER_USERNAME and MEMCACHIER_PASSWORD. These are all you need to configure your end and have access to a high-performance, fault-tolerant cache! Refer to our documentation for setting up specific languages and clients with MemCachier.

connect-info

5. Manage MemCachier

The Windows Azure interface also allows you to manage MemCachier through it. Simply, click on the ‘Add-Ons’ menu option on the left side.

addonbar

Then select the MemCachier cache you want to manage.

manage1

Now, in the bottom black bar for Windows Azure, you should have some options for interacting with MemCachier. These include upgrading (or downgrading) your cache, deleting it and finally, logging into MemCachiers own Analytics Dashboard / Management interface for the cache.

manage-bar

6. Management & Analytics Dashboard

After clicking the ‘Manage’ button, you’ll be logged into our Analytics Dashboard for your cache. Here you can observe your cache performance and behavior, flush the cache and setup New Relic Integration.

mgmt-interface

7. Enjoy!

Enjoy the benefits of having a managed memcache! High-performance, instant-failover, fault-tolerant and great support! Please always contact us at support@memcachier.com if any concerns or questions.