MagLev, our new consistent hashing scheme

A key feature that MemCachier offers its clients is the ability to use multiple proxy servers to access their cache. This is important because most issues that clients experience are related to network delays or general connectivity. In such cases a memcached client can fall back to a different proxy and still have access to the entire cache.

To implement this feature MemCachier uses consistent hashing. Just like regular hashing, this assigns every key to a specific back-end, but it has a crucial advantage: if the number of back-ends changes, only very few keys need to be reassigned. For example, if the number of back-ends increases from 3 to 4, a naive hashing scheme would assign 75% of the keys to a different back-end after the change. In such a case, a good consistent hashing scheme only changes the assignments of 25% of the keys.

What’s wrong with our current scheme?

Until now we have used the most basic consistent hashing scheme, which is based on a ring. Each back-end is assigned a point on the ring (based on the hash of an ID used to identify the back-ends) and is in charge of all values on the ring up to the point that belongs to the next back-end. As an example, consider a ring with 100 points and 3 back-ends. The ID of each back-end will be hashed to a point on the ring, i.e. a point in the interval [1, 100] identifying all the points in the ring. Ideally we’d like for the back-ends to be equally spaced around the ring, e.g., at 33, 66, and 99, which would assign all keys that map to points on the ring between 33 and 65 to the back-end at 33, and so on. However, the size of the segment of the ring each back-end is responsible for can vary quite a lot, which results in skew, e.g., if the 3 back-end IDs hashed to 12, 17, and 51, the assignment of keys to back-ends would be quite uneven.

When simulating an example architecture with 9 back-ends and a realistically sized ring (2^64 points) over 100 rounds, each time ordering the backends by their share of the ring, the distribution of the percentage of the ring each back-end gets looks like this:figure_11While the one back-end gets around 30% of all keys, others are assigned hardly any keys. This effect is generally dampened by assigning each back-end to multiple points on the ring in order to statistically even out the skew. For 100 points per back-end the situation looks like this:figure_1001This looks much better, but if we zoom in and look at the skew in more detail, we see how bad it still is:figure_1001_zoomWe can see that the median (the red center line in the boxes) of the most loaded backend is 30% above the least loaded backend. It is not unlikely (within the 95 percentile interval, the whiskers around the boxes) for the most loaded back-end to get 14.4% of the keys while the least loaded one only gets 8.3%. This is nearly a factor of two skew between the most and least loaded back-end!

Enter MagLev

MagLev is a consistent hashing algorithm developed at Google and published in 2016. Its main advantage is that it has no skew at all. It achieves this with a deterministic algorithm that builds a lookup table for keys in a round robin fashion, guaranteeing that each back-end will get the same amount of keys. What about the remapping of keys when a back-end changes? In theory, it is basically the same as the ring based consistent hashing, that is 1/(# of back-ends) or 25% when going from 3 to 4 back-ends as in the example above. In practice it is more predictable as there is hardly any variability around the 1/(# of back-ends) compared to the ring based scheme. The only downside of MagLev is that the calculation of the assignment lookup table is resource intensive and thus takes time. However, since back-ends hardly ever change in our clusters this is not an issue.

For the interested reader the paper can be found here.

What changes for our customers?

The most noticeable change is a reduction in early evictions. In a sharded cache, skew in the distribution of keys can result in evictions before the global memory limit is reached. We counteract this by assigning more memory to a cache than its advertised limit but evictions still start before 100% of the cache is used.

As we switch over from simple consistent hashing to MagLev, there will be a short transition period with a slight performance hit when getting values from the cache. The reason is that changing the consistent hashing scheme will change many assignments of keys to back-ends. Since this would result in a lot of lost values, we will have a transition phase where we fall back to the ring based consistent hashing whenever a key is not found. This will result in a latency increase for get misses as shown by the red line (average of the red dots) after the switch to MagLev (black dotted line):figure_1The expected latency increase for get misses is between 100µs and 200µs, which is 5-10% of the total request time generally seen by clients. During this transition period (which will last 30 days), cache statistics displayed in the MemCachier dashboard will also be misleading, because every get miss will be counted twice.

AWS Infrastructure Migration: Results

In this article, we’ll describe some of the performance impacts of the AWS infrastructure migration we performed recently.

State before and after migration

Before the migration, MemCachier clusters mostly used EC2-Classic instances, i.e. instances not in a VPC. The monitor machines (used for measuring cluster latencies) were rebuilt recently, and were in VPCs.

After the migration, all cluster machines are in VPCs, with one subnet per availability zone, and one security group per cluster. The monitor machines are still in the same VPCs as before. For regions with clusters in multiple AZs, this means that some latency measurements are intra-AZ measurements and some are cross-AZ measurements.

Measuring latencies

We continually (every 10 seconds) measure get, set and proxy latencies to a test cache in each cluster. The latency measurements are done using normal memcached protocol requests from the monitor machine, which is in a different VPC to the cluster EC2 instances. The idea here is to represent a reasonably realistic setup in terms of the distance from the client (monitor) machine to the cluster, replicating what a real customer client would see.

Get, set and proxy latency values are measured independently at each sampling time. (This will be relevant later, when we want to try to remove the latency to the proxy from the get and set values in order to get some idea of the intra-cluster latency.)

The raw latency data usually looks something like this (from a machine chosen at random):

raw-latency

There’s some natural variability in the raw latency values, and as is usual with these kinds of network propagation time measurements, there is a long tail towards larger latency values (the axis scale here is logarithmic, so the spikes in the latency graph are really quite significant).

We use daily latency averages to visualize longer-term trends. The last few months of data from a machine in the US-East-1 region look like this:

avg-latency-us-east-1

Although averaging hides important features of the latency distribution, there are some interesting long-term secular trends in the averages. For US-East-1, there are four distinct periods visible since the beginning of 2017. We did the VPC migration for US-East-1 at the beginning of June, which explains the change there. The period from mid-February until early March with higher average latencies was a period where we believe the monitor machine for US-East-1 was suffering from a “noisy neighbor” (where another EC2 instance on the same physical server as our monitor instance exhibits intermittently high loads, either on CPU or network communications, interfering with the performance of the monitor process). In early March, we launched a new monitor instance (with a larger instance type), which made this problem go away.  The period between launching this new monitor instance and when we performed the VPC migration has anomalously good latency values, which we think is just down to “luck” in the placement of the new monitor instance in relation to the EC2-Classic instances that were hosting our servers at that time.

Average latencies from a machine in the EU-West-1 region show much less long-term variation, although there are still some secular changes that aren’t associated with anything that we did:

avg-latency-eu-west-1

Empirical latency CDFs

Comparing time series plots of averaged latencies shows some long-term trends, but it hides the interesting and important part of the distribution of latency values. It’s an unfortunate fact of life that the most important part of the distribution of latency values is the hardest to examine. These important values are the rarer large latency values lying in the upper quantiles of the latency distribution. To understand why these are important, consider an application that, in order to service a typical user request, needs to read 50 items from its MemCachier cache (to make this concrete, think of a service like Twitter that caches pre-rendered posts and needs to read a selection of posts to render a user’s timeline).  Let’s think about the 99.9% quantile of the latency distribution, i.e. the latency value for which 999/1000 requests get a quicker response. This seems like it should be a pretty rare event, but if we make 50 requests in a row, the probability of getting a response quicker than the 99.9% quantile for all of them is (0.999)^5 = 0.951. This means that 5% of such sequences of requests will experience at least one response as bad as the 99.9% quantile, and this slowest cache response will delay the final rendering of the response to the user.

One way of visualizing the whole distribution of latency values is using the empirical cumulative distribution function (CDF). The CDF of a distribution is the function F(x) = P(X > x) that gives the probability that the value of a random variable X sampled from the distribution is larger than a given value x. To calculate an empirical estimate of the CDF of a distribution from a finite sample from the distribution, we just count the fraction of data values larger than a given value. As usual with this sort of empirical distribution estimation, what you get out is just an approximation to the true CDF: if you have enough data, it’s pretty good in the “middle” of the distribution, but you suffer a bit from sampling error in the tails of the distribution (just because there are naturally less data samples in the tails).

The following figures show the empirical CDFs for the get, set and proxy latency for a couple of machines, one in a cluster in US-East-1 and one in EU-West-1. Colours distinguish CDFs for three different 3-day time periods (one each immediately before and after the infrastructure migration and one earlier on, to show the “normal” situation for the US-East-1 region), and line thicknesses distinguish between get, set and proxy values. Latencies are measured in microseconds and are plotted on a logarithmic scale. (The scale is cut off at the right-hand end, excluding the very largest values, since this makes it easier to see what’s going on.)

composite-ecdfs

These two plots are pretty indicative of what’s seen for all of our machines. There’s a split between US-East-1 and the other regions, with latencies in the other regions being consistently lower after the migration than before: in the right-hand plot above, each of the red (after migration) “get”, “set” and “proxy” curves are to the left (i.e. lower latency values) of the corresponding blue (before migration) curve. The higher quantiles for these regions are consistently lower after the migration than before.

US-East-1 is a little different. Overall, performance as measured by latency values is comparable after the migration to what we saw in February, but the CDFs from after the migration have an extra kink in them that we can’t explain, as seen in the red curves on the left-hand plot above. We also have this period through most of February where we had very low latency values, almost certainly down to some lucky eventuality of placement of the new monitor machine that we launched. It’s interesting to see that the lowest latency values after the migration match up well with the lowest values seen during the “good” period, but the larger values correspond more closely with what we saw in February. It’s our belief that this is an artefact of the underlying AWS networking infrastructure. We’ll continue to keep an eye on this and might try starting a new monitoring instance to see if that leads to better results (here, “better” mostly means “more understandable”).

The get and set latency values shown in the CDFs above include both the time taken for a memcached protocol message to get to and from the proxy server that the monitor connects to and the time taken for communication between the proxy server and the involved cache backend server. Measuring intra-cluster latencies, i.e. that communication time between proxy and cache backend, is complicated by the independent sampling of set, get and proxy latency at each sample time. (There is obviously some correlation between the different quantities, since the long-term variability seen above in the average latency values appears to occur synchronously in each quantity, but this is swamped by stochastic variability at short time-scales.)

One thing we can do though is to look at differences in the CDFs. We basically just take the difference between (for example) the get latency and the proxy latency for each quantile value and call this the “intra-cluster get latency CDF”. The figure below shows these “difference CDFs” for get and set latency for the same two machines shown in the CDF composite plot above.

ecdf-differences

The EU-West-1 results are comparable to what we see for all machines outside US-East-1: intra-cluster get and set latencies are reduced after the migration compared to before. The situation in US-East-1 is more complicated, but the plot here is representative of what we see: intra-cluster get latencies are better or the same as the “good” period, while intra-cluster set latencies are less good, but comparable to what we saw in February. This is slightly different to what we saw in the composite CDF plots, where get, set and proxy latencies were all worse after the migration than during the “good” period. We speculate that this is just down to differences in the route taken by memcached requests from the monitor machine to the proxies in US-East-1 now that the proxies are all on machines in a VPC (as opposed to being on EC2-Classic instances). We’re still working on understanding exactly what’s going on here, and are continuing monitoring of the latency behaviour in US-East-1.

Conclusions

The kind of latency measurements that we use to monitor MemCachier performance are very susceptible to small variations in network infrastructure. This makes them a sensitive measuring tool (indeed, we use rules based on unexpected latency excursions for alerting to catch and fix networking problems early), but it can make them difficult to understand.

It seems that the most significant factor in latency behaviour really is “luck” related to infrastructure placement. This can be seen both from the existence of this “good” period in US-East-1 for which we have no other reasonable explanation, and from the long-term secular changes in latencies that we presume must be associated with changes within the AWS infrastructure (either networking changes, or the addition or removal of instances on the physical machines that our instances are hosted on). This unpredictability is the price you pay for using an IaaS service like AWS.

From the perspective of the infrastructure migration, we are quite satisfied with the performance after migration. All regions except for US-East-1 are clearly better than before, and while the results in US-East-1 are a little confusing, performance is better than it was in February, and from some perspectives is comparable even to the “lucky good period” that we had.

How we live migrated a 150 node cache

In April, we announced our plan to migrate all of our Amazon Web Services (AWS) clusters to new VPC-based infrastructure on Elastic Cloud Compute. As we wrapped that up this past month, we documented the process to migrate a large multi-tenant, distributed cache service online as well as some lessons learned.

 

But why Virtual Private Cloud?

Amazon’s VPCs have been the new de-facto infrastructure for EC2 instances for a while. Many newer EC2 instance types can only be launched inside a VPC and VPCs enable some important forward looking features like IPv6 connectivity. The aim of the migration was to move all of our clusters to the new VPC-based EC2 instances. We wanted to rationalize the instance types that we use, switching over to using newer and more perfomant instance types, as well as reorganize all clusters to use VPCs with fine-grained security group rules.

“Trust the Process”

To explain what we did during the migration, it helps to know how MemCachier uses AWS. User caches are provisioned into a MemCachier cluster, composed of several EC2 instances, each running one or more cache backend servers and a proxy server. Clients connect to a proxy server, which manage forwarding of memcached protocol requests to backends. The number of cache backends an instance runs depends on the instance type. Clusters vary in size from one or two instances (for development clusters) to 12 instances. We have clusters in each AWS region where we have customers (six regions at the time of writing).

Across our clusters, we have about 70 EC2 instances of different sizes, running about 150 cache backend servers. Within each cluster, user caches are sharded across all backend servers. Restarting a backend server evicts the fraction of each user cache managed by that backend. This means that this kind of large-scale migration has to be done gradually: most user applications can deal with gradual eviction of cache items far more easily than they can with a single “big bang” flush of the entire cache.

Automation and testing

To migrate all instances across all supported AWS regions required more than 300 steps, where a step might be “launch and configure a new instance”, “start a new backend server”, or similar, each step involving several sub-steps. Doing this manually would have been too error-prone. Instead, automating this process was key to successfully performing these steps in the right order, error-free.

Luckily, we had already automated many of the individual steps involved (e.g. launching new machines, migrating DNS records, adding/removing nodes from a cluster), but we built a new state machine-based migration tool to sequence everything and to combine individual system administration actions into larger logical steps. We tested this tool extensively on our staging environments, then used it to migrate development clusters first, to ensure that it worked correctly, before moving on to production clusters.

The result was that we had no problems during the migration that were due to human error, mis-sequencing of migration steps, or missing sub-steps within the migration plan. This was particularly important because we migrated many clusters in parallel, which made managing the overall state of the migration even more involved.

To resolve, or not to resolve?

From what we can tell, there were very few problems caused by the migration: some customers noticed and asked about the partial cache flushes that resulted from backends migrating to new machines, but there appeared to be very little disruption.

However, not everything in the garden was rosy. We have one quite serious problem during the migration. One of our larger customers noticed that they were getting inconsistent values from one of their caches, something that appeared only after we began migrating the cluster their cache is hosted on. We tracked this down to an undocumented behavior of DNS name resolution on EC2 instances (later confirmed by Amazon).

Each EC2 instance within a VPC has both a private IP address, accessible only within the VPC, and a public IP address. Intra-security group rules (i.e. “allow all traffic between instances within this security group”) work by matching the private IP address subnet, so servers within a MemCachier cluster use private IP addresses to talk to one another. An instance’s public DNS record resolves to its private IP address within a VPC since EC2 provides a DNS resolver specific to the VPC. This means that one instance within a VPC can connect to another using its DNS name and intra-security group rules will apply correctly!

The problem that we discovered was that the DNS record mapping the public DNS record to the private IP can take up to several minutes after a new instance is launched to propagate. During that period, the new instance’s DNS name resolves to its public IP address from other instances in the VPC. Of course, trying to connect to the new instance with the public IP address wouldn’t work, because our intra-security group rules restrict access to the VPC’s private IP subnet.

As a result, when cache backend servers in the cluster were notified of a new instance and tried to connect to its proxy server, the new instance’s DNS name sometimes resolved to the public IP address and the connection failed. This meant that different proxy processes within the cluster ended up being connected to different numbers of cache backends, which led to inconsistent views of some caches.

Once we had identified the problem, avoiding it for the rest of the migration was easy, since we have a mechanism to force all backends in a cluster to refresh their proxy connections, but it was a big problem for this one customer in particular. We have now also worked around this in our code, so we won’t be hit by this again.

Lessons learned

There were three main lessons we drew from all this:

  1. Scheduling: We did the migration over the space of a couple of weeks that spanned the end of the month, which was a mistake. Some customers had end-of-month processes to run and it would have been better to avoid the few days either side of the end of the month to ensure that there was no migration-related performance degradation for them.
  2. Announcements: We announced the migration in a blog post and on Twitter, and we contacted all of our larger customers by email, but those emails could have been more explicit about asking customers to let us know about any special jobs they needed to run during the migration period.
  3. Extra cluster diagnostics: The biggest problem we had (the DNS name resolution issue) could have been detected and alerted with better cluster diagnostics. We’ve had an issue open for a while to add some of those things, and it will definitely be done before we do any more large infrastructure changes.

 

MemCachier Launches on Manifold

We’ve recently launched MemCachier on Manifold! A new company that aims to bring the developer app-store to everyone, rather than have it tied to a single specific PaaS player.

We are really excited by this partnership with Manifold and the mission they are on. Please check them out and give MemCachier a spin through them.

AWS Infrastructure Migration

The AWS infrastructure that MemCachier uses for direct sign-up and Heroku customers evolves over time. Amazon releases new EC2 instance types and new network infrastructure features on a regular basis. In order to exploit these new features, MemCachier occasionally needs to migrate caches to new infrastructure. We are planning to migrate all of our clusters to new AWS infrastructure over the course of the next few weeks, starting with development caches and production caches in less-used AWS regions next week. There is likely to be some limited impact on cache performance during the migration process, particularly as we retire existing EC2 instances in favor of new ones.

This migration should generally improve performance of our cache backends by switching to a newer generation of EC2 instance types, and it will also allow us to lock down security within our infrastrucutre in a more fine-grained way by switching all of our clusters over to using AWS VPC networking.

Please direct any questions about the migration process to support@memcachier.com. We will be reaching out to customers with larger caches and more exacting utilization patterns individually in the next few days.