Announcement: cluster merging

As part of the infrastructure improvements that we started in June, we’re going to be merging some of the clusters of EC2 instances that host MemCachier caches, starting in the next couple of weeks.  This will not affect all customer caches, and the impact during the merge process will be minimal, but we wanted to be up-front about the fact that this is happening.

The main purpose of this activity is to take advantage of some of the newer instance types offered by AWS, and in particular to improve the networking performance of the instances we use.  Merging clusters also gives us the opportunity to launch new sets of instances for the new merged clusters, which means we can set these instances up in placement groups, which provide additional benefits in terms of network latency and throughput.

The only negative impact that customers may notice during the cluster merge process is that small fractions of their caches’ keyspace will be evicted as storage backends are migrated into the new merged cluster.  We’ll follow the same kind of procedure for this that we’ve used for earlier migrations, where we move backends over the course of a couple of weeks.  This has been successful in the past and hasn’t led to any noticeable performance degradation for users.

The positive impact of these changes is that all user caches in the merged clusters should exhibit lower latency once the merge is complete!

If you have any questions about this process, as usual please contact us at

AWS Infrastructure Migration: Results

In this article, we’ll describe some of the performance impacts of the AWS infrastructure migration we performed recently.

State before and after migration

Before the migration, MemCachier clusters mostly used EC2-Classic instances, i.e. instances not in a VPC. The monitor machines (used for measuring cluster latencies) were rebuilt recently, and were in VPCs.

After the migration, all cluster machines are in VPCs, with one subnet per availability zone, and one security group per cluster. The monitor machines are still in the same VPCs as before. For regions with clusters in multiple AZs, this means that some latency measurements are intra-AZ measurements and some are cross-AZ measurements.

Measuring latencies

We continually (every 10 seconds) measure get, set and proxy latencies to a test cache in each cluster. The latency measurements are done using normal memcached protocol requests from the monitor machine, which is in a different VPC to the cluster EC2 instances. The idea here is to represent a reasonably realistic setup in terms of the distance from the client (monitor) machine to the cluster, replicating what a real customer client would see.

Get, set and proxy latency values are measured independently at each sampling time. (This will be relevant later, when we want to try to remove the latency to the proxy from the get and set values in order to get some idea of the intra-cluster latency.)

The raw latency data usually looks something like this (from a machine chosen at random):


There’s some natural variability in the raw latency values, and as is usual with these kinds of network propagation time measurements, there is a long tail towards larger latency values (the axis scale here is logarithmic, so the spikes in the latency graph are really quite significant).

We use daily latency averages to visualize longer-term trends. The last few months of data from a machine in the US-East-1 region look like this:


Although averaging hides important features of the latency distribution, there are some interesting long-term secular trends in the averages. For US-East-1, there are four distinct periods visible since the beginning of 2017. We did the VPC migration for US-East-1 at the beginning of June, which explains the change there. The period from mid-February until early March with higher average latencies was a period where we believe the monitor machine for US-East-1 was suffering from a “noisy neighbor” (where another EC2 instance on the same physical server as our monitor instance exhibits intermittently high loads, either on CPU or network communications, interfering with the performance of the monitor process). In early March, we launched a new monitor instance (with a larger instance type), which made this problem go away.  The period between launching this new monitor instance and when we performed the VPC migration has anomalously good latency values, which we think is just down to “luck” in the placement of the new monitor instance in relation to the EC2-Classic instances that were hosting our servers at that time.

Average latencies from a machine in the EU-West-1 region show much less long-term variation, although there are still some secular changes that aren’t associated with anything that we did:


Empirical latency CDFs

Comparing time series plots of averaged latencies shows some long-term trends, but it hides the interesting and important part of the distribution of latency values. It’s an unfortunate fact of life that the most important part of the distribution of latency values is the hardest to examine. These important values are the rarer large latency values lying in the upper quantiles of the latency distribution. To understand why these are important, consider an application that, in order to service a typical user request, needs to read 50 items from its MemCachier cache (to make this concrete, think of a service like Twitter that caches pre-rendered posts and needs to read a selection of posts to render a user’s timeline).  Let’s think about the 99.9% quantile of the latency distribution, i.e. the latency value for which 999/1000 requests get a quicker response. This seems like it should be a pretty rare event, but if we make 50 requests in a row, the probability of getting a response quicker than the 99.9% quantile for all of them is (0.999)^5 = 0.951. This means that 5% of such sequences of requests will experience at least one response as bad as the 99.9% quantile, and this slowest cache response will delay the final rendering of the response to the user.

One way of visualizing the whole distribution of latency values is using the empirical cumulative distribution function (CDF). The CDF of a distribution is the function F(x) = P(X > x) that gives the probability that the value of a random variable X sampled from the distribution is larger than a given value x. To calculate an empirical estimate of the CDF of a distribution from a finite sample from the distribution, we just count the fraction of data values larger than a given value. As usual with this sort of empirical distribution estimation, what you get out is just an approximation to the true CDF: if you have enough data, it’s pretty good in the “middle” of the distribution, but you suffer a bit from sampling error in the tails of the distribution (just because there are naturally less data samples in the tails).

The following figures show the empirical CDFs for the get, set and proxy latency for a couple of machines, one in a cluster in US-East-1 and one in EU-West-1. Colours distinguish CDFs for three different 3-day time periods (one each immediately before and after the infrastructure migration and one earlier on, to show the “normal” situation for the US-East-1 region), and line thicknesses distinguish between get, set and proxy values. Latencies are measured in microseconds and are plotted on a logarithmic scale. (The scale is cut off at the right-hand end, excluding the very largest values, since this makes it easier to see what’s going on.)


These two plots are pretty indicative of what’s seen for all of our machines. There’s a split between US-East-1 and the other regions, with latencies in the other regions being consistently lower after the migration than before: in the right-hand plot above, each of the red (after migration) “get”, “set” and “proxy” curves are to the left (i.e. lower latency values) of the corresponding blue (before migration) curve. The higher quantiles for these regions are consistently lower after the migration than before.

US-East-1 is a little different. Overall, performance as measured by latency values is comparable after the migration to what we saw in February, but the CDFs from after the migration have an extra kink in them that we can’t explain, as seen in the red curves on the left-hand plot above. We also have this period through most of February where we had very low latency values, almost certainly down to some lucky eventuality of placement of the new monitor machine that we launched. It’s interesting to see that the lowest latency values after the migration match up well with the lowest values seen during the “good” period, but the larger values correspond more closely with what we saw in February. It’s our belief that this is an artefact of the underlying AWS networking infrastructure. We’ll continue to keep an eye on this and might try starting a new monitoring instance to see if that leads to better results (here, “better” mostly means “more understandable”).

The get and set latency values shown in the CDFs above include both the time taken for a memcached protocol message to get to and from the proxy server that the monitor connects to and the time taken for communication between the proxy server and the involved cache backend server. Measuring intra-cluster latencies, i.e. that communication time between proxy and cache backend, is complicated by the independent sampling of set, get and proxy latency at each sample time. (There is obviously some correlation between the different quantities, since the long-term variability seen above in the average latency values appears to occur synchronously in each quantity, but this is swamped by stochastic variability at short time-scales.)

One thing we can do though is to look at differences in the CDFs. We basically just take the difference between (for example) the get latency and the proxy latency for each quantile value and call this the “intra-cluster get latency CDF”. The figure below shows these “difference CDFs” for get and set latency for the same two machines shown in the CDF composite plot above.


The EU-West-1 results are comparable to what we see for all machines outside US-East-1: intra-cluster get and set latencies are reduced after the migration compared to before. The situation in US-East-1 is more complicated, but the plot here is representative of what we see: intra-cluster get latencies are better or the same as the “good” period, while intra-cluster set latencies are less good, but comparable to what we saw in February. This is slightly different to what we saw in the composite CDF plots, where get, set and proxy latencies were all worse after the migration than during the “good” period. We speculate that this is just down to differences in the route taken by memcached requests from the monitor machine to the proxies in US-East-1 now that the proxies are all on machines in a VPC (as opposed to being on EC2-Classic instances). We’re still working on understanding exactly what’s going on here, and are continuing monitoring of the latency behaviour in US-East-1.


The kind of latency measurements that we use to monitor MemCachier performance are very susceptible to small variations in network infrastructure. This makes them a sensitive measuring tool (indeed, we use rules based on unexpected latency excursions for alerting to catch and fix networking problems early), but it can make them difficult to understand.

It seems that the most significant factor in latency behaviour really is “luck” related to infrastructure placement. This can be seen both from the existence of this “good” period in US-East-1 for which we have no other reasonable explanation, and from the long-term secular changes in latencies that we presume must be associated with changes within the AWS infrastructure (either networking changes, or the addition or removal of instances on the physical machines that our instances are hosted on). This unpredictability is the price you pay for using an IaaS service like AWS.

From the perspective of the infrastructure migration, we are quite satisfied with the performance after migration. All regions except for US-East-1 are clearly better than before, and while the results in US-East-1 are a little confusing, performance is better than it was in February, and from some perspectives is comparable even to the “lucky good period” that we had.

How we live migrated a 150 node cache

In April, we announced our plan to migrate all of our Amazon Web Services (AWS) clusters to new VPC-based infrastructure on Elastic Cloud Compute. As we wrapped that up this past month, we documented the process to migrate a large multi-tenant, distributed cache service online as well as some lessons learned.


But why Virtual Private Cloud?

Amazon’s VPCs have been the new de-facto infrastructure for EC2 instances for a while. Many newer EC2 instance types can only be launched inside a VPC and VPCs enable some important forward looking features like IPv6 connectivity. The aim of the migration was to move all of our clusters to the new VPC-based EC2 instances. We wanted to rationalize the instance types that we use, switching over to using newer and more perfomant instance types, as well as reorganize all clusters to use VPCs with fine-grained security group rules.

“Trust the Process”

To explain what we did during the migration, it helps to know how MemCachier uses AWS. User caches are provisioned into a MemCachier cluster, composed of several EC2 instances, each running one or more cache backend servers and a proxy server. Clients connect to a proxy server, which manage forwarding of memcached protocol requests to backends. The number of cache backends an instance runs depends on the instance type. Clusters vary in size from one or two instances (for development clusters) to 12 instances. We have clusters in each AWS region where we have customers (six regions at the time of writing).

Across our clusters, we have about 70 EC2 instances of different sizes, running about 150 cache backend servers. Within each cluster, user caches are sharded across all backend servers. Restarting a backend server evicts the fraction of each user cache managed by that backend. This means that this kind of large-scale migration has to be done gradually: most user applications can deal with gradual eviction of cache items far more easily than they can with a single “big bang” flush of the entire cache.

Automation and testing

To migrate all instances across all supported AWS regions required more than 300 steps, where a step might be “launch and configure a new instance”, “start a new backend server”, or similar, each step involving several sub-steps. Doing this manually would have been too error-prone. Instead, automating this process was key to successfully performing these steps in the right order, error-free.

Luckily, we had already automated many of the individual steps involved (e.g. launching new machines, migrating DNS records, adding/removing nodes from a cluster), but we built a new state machine-based migration tool to sequence everything and to combine individual system administration actions into larger logical steps. We tested this tool extensively on our staging environments, then used it to migrate development clusters first, to ensure that it worked correctly, before moving on to production clusters.

The result was that we had no problems during the migration that were due to human error, mis-sequencing of migration steps, or missing sub-steps within the migration plan. This was particularly important because we migrated many clusters in parallel, which made managing the overall state of the migration even more involved.

To resolve, or not to resolve?

From what we can tell, there were very few problems caused by the migration: some customers noticed and asked about the partial cache flushes that resulted from backends migrating to new machines, but there appeared to be very little disruption.

However, not everything in the garden was rosy. We have one quite serious problem during the migration. One of our larger customers noticed that they were getting inconsistent values from one of their caches, something that appeared only after we began migrating the cluster their cache is hosted on. We tracked this down to an undocumented behavior of DNS name resolution on EC2 instances (later confirmed by Amazon).

Each EC2 instance within a VPC has both a private IP address, accessible only within the VPC, and a public IP address. Intra-security group rules (i.e. “allow all traffic between instances within this security group”) work by matching the private IP address subnet, so servers within a MemCachier cluster use private IP addresses to talk to one another. An instance’s public DNS record resolves to its private IP address within a VPC since EC2 provides a DNS resolver specific to the VPC. This means that one instance within a VPC can connect to another using its DNS name and intra-security group rules will apply correctly!

The problem that we discovered was that the DNS record mapping the public DNS record to the private IP can take up to several minutes after a new instance is launched to propagate. During that period, the new instance’s DNS name resolves to its public IP address from other instances in the VPC. Of course, trying to connect to the new instance with the public IP address wouldn’t work, because our intra-security group rules restrict access to the VPC’s private IP subnet.

As a result, when cache backend servers in the cluster were notified of a new instance and tried to connect to its proxy server, the new instance’s DNS name sometimes resolved to the public IP address and the connection failed. This meant that different proxy processes within the cluster ended up being connected to different numbers of cache backends, which led to inconsistent views of some caches.

Once we had identified the problem, avoiding it for the rest of the migration was easy, since we have a mechanism to force all backends in a cluster to refresh their proxy connections, but it was a big problem for this one customer in particular. We have now also worked around this in our code, so we won’t be hit by this again.

Lessons learned

There were three main lessons we drew from all this:

  1. Scheduling: We did the migration over the space of a couple of weeks that spanned the end of the month, which was a mistake. Some customers had end-of-month processes to run and it would have been better to avoid the few days either side of the end of the month to ensure that there was no migration-related performance degradation for them.
  2. Announcements: We announced the migration in a blog post and on Twitter, and we contacted all of our larger customers by email, but those emails could have been more explicit about asking customers to let us know about any special jobs they needed to run during the migration period.
  3. Extra cluster diagnostics: The biggest problem we had (the DNS name resolution issue) could have been detected and alerted with better cluster diagnostics. We’ve had an issue open for a while to add some of those things, and it will definitely be done before we do any more large infrastructure changes.


AWS Infrastructure Migration

The AWS infrastructure that MemCachier uses for direct sign-up and Heroku customers evolves over time. Amazon releases new EC2 instance types and new network infrastructure features on a regular basis. In order to exploit these new features, MemCachier occasionally needs to migrate caches to new infrastructure. We are planning to migrate all of our clusters to new AWS infrastructure over the course of the next few weeks, starting with development caches and production caches in less-used AWS regions next week. There is likely to be some limited impact on cache performance during the migration process, particularly as we retire existing EC2 instances in favor of new ones.

This migration should generally improve performance of our cache backends by switching to a newer generation of EC2 instance types, and it will also allow us to lock down security within our infrastrucutre in a more fine-grained way by switching all of our clusters over to using AWS VPC networking.

Please direct any questions about the migration process to We will be reaching out to customers with larger caches and more exacting utilization patterns individually in the next few days.

New Security Features

We have recently added some additional features to MemCachier. In brief, you can now have multiple sets of credentials for a cache, can rotate credentials, and can restrict some capabilities on a per-credential basis. These features are all controlled from the Credentials panel of the MemCachier dashboard for your cache:


The features described in this article are also explained in the MemCachier documentation.


Caches may now have multiple sets of username/password credentials. The credentials for a cache are listed in the table in the Credentials panel on your cache’s analytics dashboard. New credentials for a cache are created using the Add new credentials button, and credentials may be deleted using the X button next to the credentials row.

One set of credentials is distinguished as primary, while the rest are secondary credentials. For caches provided through our add-on with a third-party provider (e.g., Heroku or AppHarbor), the primary credentials are linked to the configuration variables stored with the provider. That is, these are the credentials you see when you look at your application’s environment.  Secondary credentials can be promoted to primary using the up-arrow button next to the credentials row. When secondary credentials are promoted to primary, the username and password for the credentials being promoted are pushed to the provider’s configuration variables.

In practice, this means that you can rotate the credentials for a MemCachier cache associated with a Heroku app by creating a new set of credentials and promoting them to primary. This will push the new username and password to the MEMCACHIER_USERNAME and MEMCACHIER_PASSWORD configuration variables in your Heroku app, and trigger a restart so your application will pick up the new MemCachier credentials.

This ability to have both a primary and many secondary credentials lets you to rotate your credentials with zero downtime!


We’ve further enhanced multiple credentials by adding a security extension to our memcache implementation: capabilities! These control the operations available to a client once they authenticate with the chosen credentials.

Each set of credentials for a cache has associated write and flush capabilities. By default, new credentials have both capabilities, meaning that the credentials can be used to update cache entries and flush the cache via the memcached API.

You can restrict a client to read-only access by switching off the write capability for its credentials, and prevent a client from using the memcached API to flush your cache by switching off the flush capability.

Per-credential capabilities are managed using the checkboxes on each credential row in the Credentials panel: if the checkbox is checked, the credential has the capability; if not, the capability is disabled for that credential.

Dashboard SSO rotation

The MemCachier dashboard for your cache is accessed via a persistent unique URL containing a cache-specific single-sign-on hash. For security, you may wish to rotate this cache-specific hash to generate a new unique URL for the cache dashboard. The Rotate SSO secret button in the Credentials panel of the dashboard does this hash rotation. Pressing this button generates a new hash and redirects the dashboard to the resulting new unique URL.

Security @ MemCachier

We value security and hope that you find these new features valuable to improve your own practices. As always, for a guide to MemCachier’s own internal security practices, see our documentation.