MagLev, our new consistent hashing scheme

A key feature that MemCachier offers its clients is the ability to use multiple proxy servers to access their cache. This is important because most issues that clients experience are related to network delays or general connectivity. In such cases a memcached client can fall back to a different proxy and still have access to the entire cache.

To implement this feature MemCachier uses consistent hashing. Just like regular hashing, this assigns every key to a specific back-end, but it has a crucial advantage: if the number of back-ends changes, only very few keys need to be reassigned. For example, if the number of back-ends increases from 3 to 4, a naive hashing scheme would assign 75% of the keys to a different back-end after the change. In such a case, a good consistent hashing scheme only changes the assignments of 25% of the keys.

What’s wrong with our current scheme?

Until now we have used the most basic consistent hashing scheme, which is based on a ring. Each back-end is assigned a point on the ring (based on the hash of an ID used to identify the back-ends) and is in charge of all values on the ring up to the point that belongs to the next back-end. As an example, consider a ring with 100 points and 3 back-ends. The ID of each back-end will be hashed to a point on the ring, i.e. a point in the interval [1, 100] identifying all the points in the ring. Ideally we’d like for the back-ends to be equally spaced around the ring, e.g., at 33, 66, and 99, which would assign all keys that map to points on the ring between 33 and 65 to the back-end at 33, and so on. However, the size of the segment of the ring each back-end is responsible for can vary quite a lot, which results in skew, e.g., if the 3 back-end IDs hashed to 12, 17, and 51, the assignment of keys to back-ends would be quite uneven.

When simulating an example architecture with 9 back-ends and a realistically sized ring (2^64 points) over 100 rounds, each time ordering the backends by their share of the ring, the distribution of the percentage of the ring each back-end gets looks like this:figure_11While the one back-end gets around 30% of all keys, others are assigned hardly any keys. This effect is generally dampened by assigning each back-end to multiple points on the ring in order to statistically even out the skew. For 100 points per back-end the situation looks like this:figure_1001This looks much better, but if we zoom in and look at the skew in more detail, we see how bad it still is:figure_1001_zoomWe can see that the median (the red center line in the boxes) of the most loaded backend is 30% above the least loaded backend. It is not unlikely (within the 95 percentile interval, the whiskers around the boxes) for the most loaded back-end to get 14.4% of the keys while the least loaded one only gets 8.3%. This is nearly a factor of two skew between the most and least loaded back-end!

Enter MagLev

MagLev is a consistent hashing algorithm developed at Google and published in 2016. Its main advantage is that it has no skew at all. It achieves this with a deterministic algorithm that builds a lookup table for keys in a round robin fashion, guaranteeing that each back-end will get the same amount of keys. What about the remapping of keys when a back-end changes? In theory, it is basically the same as the ring based consistent hashing, that is 1/(# of back-ends) or 25% when going from 3 to 4 back-ends as in the example above. In practice it is more predictable as there is hardly any variability around the 1/(# of back-ends) compared to the ring based scheme. The only downside of MagLev is that the calculation of the assignment lookup table is resource intensive and thus takes time. However, since back-ends hardly ever change in our clusters this is not an issue.

For the interested reader the paper can be found here.

What changes for our customers?

The most noticeable change is a reduction in early evictions. In a sharded cache, skew in the distribution of keys can result in evictions before the global memory limit is reached. We counteract this by assigning more memory to a cache than its advertised limit but evictions still start before 100% of the cache is used.

As we switch over from simple consistent hashing to MagLev, there will be a short transition period with a slight performance hit when getting values from the cache. The reason is that changing the consistent hashing scheme will change many assignments of keys to back-ends. Since this would result in a lot of lost values, we will have a transition phase where we fall back to the ring based consistent hashing whenever a key is not found. This will result in a latency increase for get misses as shown by the red line (average of the red dots) after the switch to MagLev (black dotted line):figure_1The expected latency increase for get misses is between 100µs and 200µs, which is 5-10% of the total request time generally seen by clients. During this transition period (which will last 30 days), cache statistics displayed in the MemCachier dashboard will also be misleading, because every get miss will be counted twice.

Revamped Status Page

We are happy to announce some wide-ranging improvements to our status page. While it looks the same on the surface, it has been rewritten from scratch and is now much more tightly integrated with our monitoring infrastructure. For our customers, this has three main consequences.

Fully automated status updates

Status changes are now fully automated. Over the last few months we have carefully tuned the sensitivity of the status page to the point where we are now happy with its accuracy and responsiveness. While we continue to improve the algorithms that monitor our infrastructure, the status page reflects the health of our infrastructure better than ever.

Better differentiated status

We now differentiate between increased latency and reachability for the status of the proxies in our clusters. For convenience we use a simple traffic light approach: green means all proxies in a cluster are up and running and responding quickly enough, yellow means one or more proxies in a cluster are responding more slowly than usual, and red means that our monitoring infrastructure is unable to reach the proxy at all.

“Recently Resolved”

Previous support incidents have revealed that customers often visit our status page after an incident has already been resolved.  They are then puzzled because all our clusters show green statuses when they have clearly just had trouble connecting to our servers. To avoid this confusion, we now flag such clusters as “Recently Resolved” and highlight them with the color blue.

All these changes are also reflected in the status RSS feed. Please let us know if you think these changes are useful or what other features you would like to see next. We have an ongoing program of improvements in system monitoring and fault diagnosis, but we’re always very happy to hear ideas from customers that would make their experience using MemCachier better!

Flush Command Logging for Heroku

We are happy to announce a new feature for our Heroku customers. In the past we have had several requests from customer who wanted to know why their caches had been flushed. To help our clients find out how a flush command came about we now push a log message to the Heroku log whenever a cache is flushed.

The log message contains the hostname of the proxy the flush command was executed on as well as its origin. The origin can be one of the following:

  • memcache client: The flush command originates from a normal memcached client.
  • web interface: This happens when a client clicks “Flush Cache” on the analytics dashboard.
  • admin operation: MemCachier flushed the cache on your behalf. This can occur when you do perform operations like switching from one cluster to another.

We hope you find it useful! We’ll be looking to add more information to the Heroku log in the future and would love suggestions and feedback on this.

New Analytics Dashboard and Cache Migration

We are happy to announce a new design for the analytics dashboard and the ability to move your cache from one cluster to another in emergencies.

Screenshot from 2016-11-15 09-30-30.png

In August, we had the biggest incident in MemCachier history which affected two clusters on Amazon in the US-East-1 region simultaneously. During our post-mortem analysis we realized that, in extreme cases like these, a significant number of customers would prefer to move to a different cluster even if it means losing the contents of their cache.

As part of a wide range of improvement to better handle such extreme outages again, we’ve built out and decided to expose to customers the ability to move their cache to a different cluster.

The feature can be found in the redesigned analytics dashboard and is available to all paying customers. When used, the cache will be moved to the least loaded cluster in the same region and data in the old cluster will be flushed. The transition is seamless, meaning the configuration of the memcache client does not change. However, the DNS record change might take up to 3 minutes to propagate. It is important to note that this feature is meant to be used only in emergencies! Please always refer to http://status.memcachier.com to see if the cluster you are on is experiencing problems.

We are also changing how we create DNS records for customers as part of this change. While you used to be able to tell what cluster you were on from the DNS records, now the analytics dashboard will display this information.

One other use of this feature other than as a last resort in downtime, is for creating a second cache and ensuring your two caches are on independent clusters. If the two created caches happen to be on the same cluster, simply move one!

Rest assured that we’ve improved and continue to improve our ability to handle extreme outages without your involvement or loss of any cached data. But, it’s always nice to have a last resort.