How we live migrated a 150 node cache

In April, we announced our plan to migrate all of our Amazon Web Services (AWS) clusters to new VPC-based infrastructure on Elastic Cloud Compute. As we wrapped that up this past month, we documented the process to migrate a large multi-tenant, distributed cache service online as well as some lessons learned.


But why Virtual Private Cloud?

Amazon’s VPCs have been the new de-facto infrastructure for EC2 instances for a while. Many newer EC2 instance types can only be launched inside a VPC and VPCs enable some important forward looking features like IPv6 connectivity. The aim of the migration was to move all of our clusters to the new VPC-based EC2 instances. We wanted to rationalize the instance types that we use, switching over to using newer and more perfomant instance types, as well as reorganize all clusters to use VPCs with fine-grained security group rules.

“Trust the Process”

To explain what we did during the migration, it helps to know how MemCachier uses AWS. User caches are provisioned into a MemCachier cluster, composed of several EC2 instances, each running one or more cache backend servers and a proxy server. Clients connect to a proxy server, which manage forwarding of memcached protocol requests to backends. The number of cache backends an instance runs depends on the instance type. Clusters vary in size from one or two instances (for development clusters) to 12 instances. We have clusters in each AWS region where we have customers (six regions at the time of writing).

Across our clusters, we have about 70 EC2 instances of different sizes, running about 150 cache backend servers. Within each cluster, user caches are sharded across all backend servers. Restarting a backend server evicts the fraction of each user cache managed by that backend. This means that this kind of large-scale migration has to be done gradually: most user applications can deal with gradual eviction of cache items far more easily than they can with a single “big bang” flush of the entire cache.

Automation and testing

To migrate all instances across all supported AWS regions required more than 300 steps, where a step might be “launch and configure a new instance”, “start a new backend server”, or similar, each step involving several sub-steps. Doing this manually would have been too error-prone. Instead, automating this process was key to successfully performing these steps in the right order, error-free.

Luckily, we had already automated many of the individual steps involved (e.g. launching new machines, migrating DNS records, adding/removing nodes from a cluster), but we built a new state machine-based migration tool to sequence everything and to combine individual system administration actions into larger logical steps. We tested this tool extensively on our staging environments, then used it to migrate development clusters first, to ensure that it worked correctly, before moving on to production clusters.

The result was that we had no problems during the migration that were due to human error, mis-sequencing of migration steps, or missing sub-steps within the migration plan. This was particularly important because we migrated many clusters in parallel, which made managing the overall state of the migration even more involved.

To resolve, or not to resolve?

From what we can tell, there were very few problems caused by the migration: some customers noticed and asked about the partial cache flushes that resulted from backends migrating to new machines, but there appeared to be very little disruption.

However, not everything in the garden was rosy. We have one quite serious problem during the migration. One of our larger customers noticed that they were getting inconsistent values from one of their caches, something that appeared only after we began migrating the cluster their cache is hosted on. We tracked this down to an undocumented behavior of DNS name resolution on EC2 instances (later confirmed by Amazon).

Each EC2 instance within a VPC has both a private IP address, accessible only within the VPC, and a public IP address. Intra-security group rules (i.e. “allow all traffic between instances within this security group”) work by matching the private IP address subnet, so servers within a MemCachier cluster use private IP addresses to talk to one another. An instance’s public DNS record resolves to its private IP address within a VPC since EC2 provides a DNS resolver specific to the VPC. This means that one instance within a VPC can connect to another using its DNS name and intra-security group rules will apply correctly!

The problem that we discovered was that the DNS record mapping the public DNS record to the private IP can take up to several minutes after a new instance is launched to propagate. During that period, the new instance’s DNS name resolves to its public IP address from other instances in the VPC. Of course, trying to connect to the new instance with the public IP address wouldn’t work, because our intra-security group rules restrict access to the VPC’s private IP subnet.

As a result, when cache backend servers in the cluster were notified of a new instance and tried to connect to its proxy server, the new instance’s DNS name sometimes resolved to the public IP address and the connection failed. This meant that different proxy processes within the cluster ended up being connected to different numbers of cache backends, which led to inconsistent views of some caches.

Once we had identified the problem, avoiding it for the rest of the migration was easy, since we have a mechanism to force all backends in a cluster to refresh their proxy connections, but it was a big problem for this one customer in particular. We have now also worked around this in our code, so we won’t be hit by this again.

Lessons learned

There were three main lessons we drew from all this:

  1. Scheduling: We did the migration over the space of a couple of weeks that spanned the end of the month, which was a mistake. Some customers had end-of-month processes to run and it would have been better to avoid the few days either side of the end of the month to ensure that there was no migration-related performance degradation for them.
  2. Announcements: We announced the migration in a blog post and on Twitter, and we contacted all of our larger customers by email, but those emails could have been more explicit about asking customers to let us know about any special jobs they needed to run during the migration period.
  3. Extra cluster diagnostics: The biggest problem we had (the DNS name resolution issue) could have been detected and alerted with better cluster diagnostics. We’ve had an issue open for a while to add some of those things, and it will definitely be done before we do any more large infrastructure changes.


MemCachier Launches on Manifold

We’ve recently launched MemCachier on Manifold! A new company that aims to bring the developer app-store to everyone, rather than have it tied to a single specific PaaS player.

We are really excited by this partnership with Manifold and the mission they are on. Please check them out and give MemCachier a spin through them.

AWS Infrastructure Migration

The AWS infrastructure that MemCachier uses for direct sign-up and Heroku customers evolves over time. Amazon releases new EC2 instance types and new network infrastructure features on a regular basis. In order to exploit these new features, MemCachier occasionally needs to migrate caches to new infrastructure. We are planning to migrate all of our clusters to new AWS infrastructure over the course of the next few weeks, starting with development caches and production caches in less-used AWS regions next week. There is likely to be some limited impact on cache performance during the migration process, particularly as we retire existing EC2 instances in favor of new ones.

This migration should generally improve performance of our cache backends by switching to a newer generation of EC2 instance types, and it will also allow us to lock down security within our infrastrucutre in a more fine-grained way by switching all of our clusters over to using AWS VPC networking.

Please direct any questions about the migration process to We will be reaching out to customers with larger caches and more exacting utilization patterns individually in the next few days.

Revamped Status Page

We are happy to announce some wide-ranging improvements to our status page. While it looks the same on the surface, it has been rewritten from scratch and is now much more tightly integrated with our monitoring infrastructure. For our customers, this has three main consequences.

Fully automated status updates

Status changes are now fully automated. Over the last few months we have carefully tuned the sensitivity of the status page to the point where we are now happy with its accuracy and responsiveness. While we continue to improve the algorithms that monitor our infrastructure, the status page reflects the health of our infrastructure better than ever.

Better differentiated status

We now differentiate between increased latency and reachability for the status of the proxies in our clusters. For convenience we use a simple traffic light approach: green means all proxies in a cluster are up and running and responding quickly enough, yellow means one or more proxies in a cluster are responding more slowly than usual, and red means that our monitoring infrastructure is unable to reach the proxy at all.

“Recently Resolved”

Previous support incidents have revealed that customers often visit our status page after an incident has already been resolved.  They are then puzzled because all our clusters show green statuses when they have clearly just had trouble connecting to our servers. To avoid this confusion, we now flag such clusters as “Recently Resolved” and highlight them with the color blue.

All these changes are also reflected in the status RSS feed. Please let us know if you think these changes are useful or what other features you would like to see next. We have an ongoing program of improvements in system monitoring and fault diagnosis, but we’re always very happy to hear ideas from customers that would make their experience using MemCachier better!

New Security Features

We have recently added some additional features to MemCachier. In brief, you can now have multiple sets of credentials for a cache, can rotate credentials, and can restrict some capabilities on a per-credential basis. These features are all controlled from the Credentials panel of the MemCachier dashboard for your cache:


The features described in this article are also explained in the MemCachier documentation.


Caches may now have multiple sets of username/password credentials. The credentials for a cache are listed in the table in the Credentials panel on your cache’s analytics dashboard. New credentials for a cache are created using the Add new credentials button, and credentials may be deleted using the X button next to the credentials row.

One set of credentials is distinguished as primary, while the rest are secondary credentials. For caches provided through our add-on with a third-party provider (e.g., Heroku or AppHarbor), the primary credentials are linked to the configuration variables stored with the provider. That is, these are the credentials you see when you look at your application’s environment.  Secondary credentials can be promoted to primary using the up-arrow button next to the credentials row. When secondary credentials are promoted to primary, the username and password for the credentials being promoted are pushed to the provider’s configuration variables.

In practice, this means that you can rotate the credentials for a MemCachier cache associated with a Heroku app by creating a new set of credentials and promoting them to primary. This will push the new username and password to the MEMCACHIER_USERNAME and MEMCACHIER_PASSWORD configuration variables in your Heroku app, and trigger a restart so your application will pick up the new MemCachier credentials.

This ability to have both a primary and many secondary credentials lets you to rotate your credentials with zero downtime!


We’ve further enhanced multiple credentials by adding a security extension to our memcache implementation: capabilities! These control the operations available to a client once they authenticate with the chosen credentials.

Each set of credentials for a cache has associated write and flush capabilities. By default, new credentials have both capabilities, meaning that the credentials can be used to update cache entries and flush the cache via the memcached API.

You can restrict a client to read-only access by switching off the write capability for its credentials, and prevent a client from using the memcached API to flush your cache by switching off the flush capability.

Per-credential capabilities are managed using the checkboxes on each credential row in the Credentials panel: if the checkbox is checked, the credential has the capability; if not, the capability is disabled for that credential.

Dashboard SSO rotation

The MemCachier dashboard for your cache is accessed via a persistent unique URL containing a cache-specific single-sign-on hash. For security, you may wish to rotate this cache-specific hash to generate a new unique URL for the cache dashboard. The Rotate SSO secret button in the Credentials panel of the dashboard does this hash rotation. Pressing this button generates a new hash and redirects the dashboard to the resulting new unique URL.

Security @ MemCachier

We value security and hope that you find these new features valuable to improve your own practices. As always, for a guide to MemCachier’s own internal security practices, see our documentation.