Memcached DDoS and MemCachier

In light of recent DDoS attacks involving memcached servers several customers have asked us about MemCachier’s vulnerability to this type of attack. At MemCachier, we are not vulnerable to these kinds of attacks because we use a custom multi-tenant implementation of the memcached protocol that is optimized for security, reliability, and performance as a cloud service.

How does it work?

This particular attack was possible because the standard memcached implementation allows for UDP requests. An attacker that is able to spoof his or her IP address can send requests on behalf of the victim, who would then get all the responses. In particular an attacker would first store large values (up to 1 MB) at the memcached server and then issue many requests on behalf of the victim to get these large values. This blog post explains the attack in more detail.

What about your memcached server?

If you are running memcached that is older than version 1.5.6 and you have not explicitly turned off the UDP port, then you are vulnerable.

To prevent it you basically need to disable the UDP port of your memcached server.  To do this you can edit the systemd service file with systemctl edit memcached.service --full and make sure memcached is started in ExecStart with the -U 0 flag. Additionally, if your memcached server is only used by a local process make sure the -l flag is set as well. The whole line should look something like this:

ExecStart=/usr/bin/memcached -l -U 0 -o modern

After editing the service file you need to restart the memcached server with

sudo systemctl restart memcached

If you are using upstart or some other system instead of systemd check out this guide for how to configure memached.

In addition to closing the the hole one the memcached server itself, you should also revise your firewall policy. Random UDP ports (and not just UDP for that matter) should not be open to the public.

Why are MemCachier’s servers safe?

MemCachier’s memcached implementation does not listen to UDP requests, so is not vulnerable to this form of attack. In addition, MemCachier’s servers only process requests from authenticated clients so a random attacker cannot connect to our servers without valid credentials.

Attacks like this highlight the benefit of using a memcached service like MemCachier because customers don’t need to worry about vulnerabilities or misconfiguration. At MemCachier we have not had a known security vulnerability for the six years we have been operating. Of course we are not totally immune to exploits, but should an issue ever arise we have a professional team that can quickly deal with such annoyances so our customers don’t have to.

Announcement: cluster merging

As part of the infrastructure improvements that we started in June, we’re going to be merging some of the clusters of EC2 instances that host MemCachier caches, starting in the next couple of weeks.  This will not affect all customer caches, and the impact during the merge process will be minimal, but we wanted to be up-front about the fact that this is happening.

The main purpose of this activity is to take advantage of some of the newer instance types offered by AWS, and in particular to improve the networking performance of the instances we use.  Merging clusters also gives us the opportunity to launch new sets of instances for the new merged clusters, which means we can set these instances up in placement groups, which provide additional benefits in terms of network latency and throughput.

The only negative impact that customers may notice during the cluster merge process is that small fractions of their caches’ keyspace will be evicted as storage backends are migrated into the new merged cluster.  We’ll follow the same kind of procedure for this that we’ve used for earlier migrations, where we move backends over the course of a couple of weeks.  This has been successful in the past and hasn’t led to any noticeable performance degradation for users.

The positive impact of these changes is that all user caches in the merged clusters should exhibit lower latency once the merge is complete!

If you have any questions about this process, as usual please contact us at

New API for the Analytics Dashboard Features

This week we exposed the API feeding our analytics dashboard so you can hook into some of its more convenient features. With our new public API, you can retrieve statistics and historical data to analyze your cache’s activity. You’ll be able to add, edit, promote, and delete credentials. You’ll also have the option to trigger a cache flush command via the API. This means you can now integrate your cache into other processes, such as setting up an HTTP deploy hook through Heroku. Let’s walk through what that would look like.

Say you’ve got a staging environment on Heroku. Due to the temporary nature of staging environments, test data has a tendency to become stale quickly. Chances are you’ll end up with a bunch of leftover cruft in your cache that you don’t need, don’t want, and that makes it difficult to effectively test your site. No problem. We’ll set up an HTTP deploy hook that will flush your cache automatically with each deploy so that your cache is clean and ready to go.

The first thing you’ll need is a set of Memcachier credentials that have API capabilities. On the analytics dashboard for your cache, you’ll notice an additional checkbox next to each set of credentials labeled “API”. This tells you whether or not that particular set of credentials is allowed to operate the API.

Screen Shot 2017-09-07 at 11.14.48 AM.png

We added an API capability to the already existing credential options.

Now that we have the necessary authorization information, we’ll find your Memcachier API ID. (Your Memcachier API ID is not the same number listed under the settings section of the analytics dashboard.) You’ll need to retrieve it via the API:

$ curl "" \
> --user "9D8D3F:DEDF58E7704AC284E5B92F4B87B8A2DB"

Your payload should look something like this:

  "cache_id": 123456

Now we can access all of the fun stuff the API has to offer. For kicks, let’s check out the list of credentials available for this cache:

$ curl "" \
> --user "9D8D3F:DEDF58E7704AC284E5B92F4B87B8A2DB"


  "uuid": null,
  "flush_capability": true,
  "cache_id": 253499,
  "sasl_username": "9D8D3F",
  "api_capability": true,
  "write_capability": true,
  "id": 247459
  "uuid": null,
  "flush_capability": true,
  "cache_id": 253499,
  "sasl_username": "AA5930",
  "api_capability": true,
  "write_capability": true,
  "id": 248867

Alright, now back to the task at hand. For our deploy hook to successfully run a flush command via the API, we’ll need a set of credentials that has both API capabilities and flush capabilities. Once you have your selected set of credentials, we’ll authenticate the API call by including the credential username and password at the beginning of your HTTP URL. You can find more information about setting up a Heroku deploy hook in the Heroku Dev Center.

$ heroku addons:create deployhooks:http \
> --url="https://9D8D3F:DEDF58E7704AC284E5B92F4B87B8A2DB@\

And that’s it! Pretty simple. Now, anytime you deploy a new branch to the heroku master, your cache will automatically flush all of that unnecessary old data away allowing you to run tests on data that’s new and relevant. Just keep in mind that if you ever update or delete that set of credentials, you’ll need to manually update your deploy hook as well.

If you’re curious about all the other functionality our API has to offer, you can find the full documentation for it on our documentation page.

MagLev, our new consistent hashing scheme

A key feature that MemCachier offers its clients is the ability to use multiple proxy servers to access their cache. This is important because most issues that clients experience are related to network delays or general connectivity. In such cases a memcached client can fall back to a different proxy and still have access to the entire cache.

To implement this feature MemCachier uses consistent hashing. Just like regular hashing, this assigns every key to a specific back-end, but it has a crucial advantage: if the number of back-ends changes, only very few keys need to be reassigned. For example, if the number of back-ends increases from 3 to 4, a naive hashing scheme would assign 75% of the keys to a different back-end after the change. In such a case, a good consistent hashing scheme only changes the assignments of 25% of the keys.

What’s wrong with our current scheme?

Until now we have used the most basic consistent hashing scheme, which is based on a ring. Each back-end is assigned a point on the ring (based on the hash of an ID used to identify the back-ends) and is in charge of all values on the ring up to the point that belongs to the next back-end. As an example, consider a ring with 100 points and 3 back-ends. The ID of each back-end will be hashed to a point on the ring, i.e. a point in the interval [1, 100] identifying all the points in the ring. Ideally we’d like for the back-ends to be equally spaced around the ring, e.g., at 33, 66, and 99, which would assign all keys that map to points on the ring between 33 and 65 to the back-end at 33, and so on. However, the size of the segment of the ring each back-end is responsible for can vary quite a lot, which results in skew, e.g., if the 3 back-end IDs hashed to 12, 17, and 51, the assignment of keys to back-ends would be quite uneven.

When simulating an example architecture with 9 back-ends and a realistically sized ring (2^64 points) over 100 rounds, each time ordering the backends by their share of the ring, the distribution of the percentage of the ring each back-end gets looks like this:figure_11While the one back-end gets around 30% of all keys, others are assigned hardly any keys. This effect is generally dampened by assigning each back-end to multiple points on the ring in order to statistically even out the skew. For 100 points per back-end the situation looks like this:figure_1001This looks much better, but if we zoom in and look at the skew in more detail, we see how bad it still is:figure_1001_zoomWe can see that the median (the red center line in the boxes) of the most loaded backend is 30% above the least loaded backend. It is not unlikely (within the 95 percentile interval, the whiskers around the boxes) for the most loaded back-end to get 14.4% of the keys while the least loaded one only gets 8.3%. This is nearly a factor of two skew between the most and least loaded back-end!

Enter MagLev

MagLev is a consistent hashing algorithm developed at Google and published in 2016. Its main advantage is that it has no skew at all. It achieves this with a deterministic algorithm that builds a lookup table for keys in a round robin fashion, guaranteeing that each back-end will get the same amount of keys. What about the remapping of keys when a back-end changes? In theory, it is basically the same as the ring based consistent hashing, that is 1/(# of back-ends) or 25% when going from 3 to 4 back-ends as in the example above. In practice it is more predictable as there is hardly any variability around the 1/(# of back-ends) compared to the ring based scheme. The only downside of MagLev is that the calculation of the assignment lookup table is resource intensive and thus takes time. However, since back-ends hardly ever change in our clusters this is not an issue.

For the interested reader the paper can be found here.

What changes for our customers?

The most noticeable change is a reduction in early evictions. In a sharded cache, skew in the distribution of keys can result in evictions before the global memory limit is reached. We counteract this by assigning more memory to a cache than its advertised limit but evictions still start before 100% of the cache is used.

As we switch over from simple consistent hashing to MagLev, there will be a short transition period with a slight performance hit when getting values from the cache. The reason is that changing the consistent hashing scheme will change many assignments of keys to back-ends. Since this would result in a lot of lost values, we will have a transition phase where we fall back to the ring based consistent hashing whenever a key is not found. This will result in a latency increase for get misses as shown by the red line (average of the red dots) after the switch to MagLev (black dotted line):figure_1The expected latency increase for get misses is between 100µs and 200µs, which is 5-10% of the total request time generally seen by clients. During this transition period (which will last 30 days), cache statistics displayed in the MemCachier dashboard will also be misleading, because every get miss will be counted twice.

AWS Infrastructure Migration: Results

In this article, we’ll describe some of the performance impacts of the AWS infrastructure migration we performed recently.

State before and after migration

Before the migration, MemCachier clusters mostly used EC2-Classic instances, i.e. instances not in a VPC. The monitor machines (used for measuring cluster latencies) were rebuilt recently, and were in VPCs.

After the migration, all cluster machines are in VPCs, with one subnet per availability zone, and one security group per cluster. The monitor machines are still in the same VPCs as before. For regions with clusters in multiple AZs, this means that some latency measurements are intra-AZ measurements and some are cross-AZ measurements.

Measuring latencies

We continually (every 10 seconds) measure get, set and proxy latencies to a test cache in each cluster. The latency measurements are done using normal memcached protocol requests from the monitor machine, which is in a different VPC to the cluster EC2 instances. The idea here is to represent a reasonably realistic setup in terms of the distance from the client (monitor) machine to the cluster, replicating what a real customer client would see.

Get, set and proxy latency values are measured independently at each sampling time. (This will be relevant later, when we want to try to remove the latency to the proxy from the get and set values in order to get some idea of the intra-cluster latency.)

The raw latency data usually looks something like this (from a machine chosen at random):


There’s some natural variability in the raw latency values, and as is usual with these kinds of network propagation time measurements, there is a long tail towards larger latency values (the axis scale here is logarithmic, so the spikes in the latency graph are really quite significant).

We use daily latency averages to visualize longer-term trends. The last few months of data from a machine in the US-East-1 region look like this:


Although averaging hides important features of the latency distribution, there are some interesting long-term secular trends in the averages. For US-East-1, there are four distinct periods visible since the beginning of 2017. We did the VPC migration for US-East-1 at the beginning of June, which explains the change there. The period from mid-February until early March with higher average latencies was a period where we believe the monitor machine for US-East-1 was suffering from a “noisy neighbor” (where another EC2 instance on the same physical server as our monitor instance exhibits intermittently high loads, either on CPU or network communications, interfering with the performance of the monitor process). In early March, we launched a new monitor instance (with a larger instance type), which made this problem go away.  The period between launching this new monitor instance and when we performed the VPC migration has anomalously good latency values, which we think is just down to “luck” in the placement of the new monitor instance in relation to the EC2-Classic instances that were hosting our servers at that time.

Average latencies from a machine in the EU-West-1 region show much less long-term variation, although there are still some secular changes that aren’t associated with anything that we did:


Empirical latency CDFs

Comparing time series plots of averaged latencies shows some long-term trends, but it hides the interesting and important part of the distribution of latency values. It’s an unfortunate fact of life that the most important part of the distribution of latency values is the hardest to examine. These important values are the rarer large latency values lying in the upper quantiles of the latency distribution. To understand why these are important, consider an application that, in order to service a typical user request, needs to read 50 items from its MemCachier cache (to make this concrete, think of a service like Twitter that caches pre-rendered posts and needs to read a selection of posts to render a user’s timeline).  Let’s think about the 99.9% quantile of the latency distribution, i.e. the latency value for which 999/1000 requests get a quicker response. This seems like it should be a pretty rare event, but if we make 50 requests in a row, the probability of getting a response quicker than the 99.9% quantile for all of them is (0.999)^5 = 0.951. This means that 5% of such sequences of requests will experience at least one response as bad as the 99.9% quantile, and this slowest cache response will delay the final rendering of the response to the user.

One way of visualizing the whole distribution of latency values is using the empirical cumulative distribution function (CDF). The CDF of a distribution is the function F(x) = P(X > x) that gives the probability that the value of a random variable X sampled from the distribution is larger than a given value x. To calculate an empirical estimate of the CDF of a distribution from a finite sample from the distribution, we just count the fraction of data values larger than a given value. As usual with this sort of empirical distribution estimation, what you get out is just an approximation to the true CDF: if you have enough data, it’s pretty good in the “middle” of the distribution, but you suffer a bit from sampling error in the tails of the distribution (just because there are naturally less data samples in the tails).

The following figures show the empirical CDFs for the get, set and proxy latency for a couple of machines, one in a cluster in US-East-1 and one in EU-West-1. Colours distinguish CDFs for three different 3-day time periods (one each immediately before and after the infrastructure migration and one earlier on, to show the “normal” situation for the US-East-1 region), and line thicknesses distinguish between get, set and proxy values. Latencies are measured in microseconds and are plotted on a logarithmic scale. (The scale is cut off at the right-hand end, excluding the very largest values, since this makes it easier to see what’s going on.)


These two plots are pretty indicative of what’s seen for all of our machines. There’s a split between US-East-1 and the other regions, with latencies in the other regions being consistently lower after the migration than before: in the right-hand plot above, each of the red (after migration) “get”, “set” and “proxy” curves are to the left (i.e. lower latency values) of the corresponding blue (before migration) curve. The higher quantiles for these regions are consistently lower after the migration than before.

US-East-1 is a little different. Overall, performance as measured by latency values is comparable after the migration to what we saw in February, but the CDFs from after the migration have an extra kink in them that we can’t explain, as seen in the red curves on the left-hand plot above. We also have this period through most of February where we had very low latency values, almost certainly down to some lucky eventuality of placement of the new monitor machine that we launched. It’s interesting to see that the lowest latency values after the migration match up well with the lowest values seen during the “good” period, but the larger values correspond more closely with what we saw in February. It’s our belief that this is an artefact of the underlying AWS networking infrastructure. We’ll continue to keep an eye on this and might try starting a new monitoring instance to see if that leads to better results (here, “better” mostly means “more understandable”).

The get and set latency values shown in the CDFs above include both the time taken for a memcached protocol message to get to and from the proxy server that the monitor connects to and the time taken for communication between the proxy server and the involved cache backend server. Measuring intra-cluster latencies, i.e. that communication time between proxy and cache backend, is complicated by the independent sampling of set, get and proxy latency at each sample time. (There is obviously some correlation between the different quantities, since the long-term variability seen above in the average latency values appears to occur synchronously in each quantity, but this is swamped by stochastic variability at short time-scales.)

One thing we can do though is to look at differences in the CDFs. We basically just take the difference between (for example) the get latency and the proxy latency for each quantile value and call this the “intra-cluster get latency CDF”. The figure below shows these “difference CDFs” for get and set latency for the same two machines shown in the CDF composite plot above.


The EU-West-1 results are comparable to what we see for all machines outside US-East-1: intra-cluster get and set latencies are reduced after the migration compared to before. The situation in US-East-1 is more complicated, but the plot here is representative of what we see: intra-cluster get latencies are better or the same as the “good” period, while intra-cluster set latencies are less good, but comparable to what we saw in February. This is slightly different to what we saw in the composite CDF plots, where get, set and proxy latencies were all worse after the migration than during the “good” period. We speculate that this is just down to differences in the route taken by memcached requests from the monitor machine to the proxies in US-East-1 now that the proxies are all on machines in a VPC (as opposed to being on EC2-Classic instances). We’re still working on understanding exactly what’s going on here, and are continuing monitoring of the latency behaviour in US-East-1.


The kind of latency measurements that we use to monitor MemCachier performance are very susceptible to small variations in network infrastructure. This makes them a sensitive measuring tool (indeed, we use rules based on unexpected latency excursions for alerting to catch and fix networking problems early), but it can make them difficult to understand.

It seems that the most significant factor in latency behaviour really is “luck” related to infrastructure placement. This can be seen both from the existence of this “good” period in US-East-1 for which we have no other reasonable explanation, and from the long-term secular changes in latencies that we presume must be associated with changes within the AWS infrastructure (either networking changes, or the addition or removal of instances on the physical machines that our instances are hosted on). This unpredictability is the price you pay for using an IaaS service like AWS.

From the perspective of the infrastructure migration, we are quite satisfied with the performance after migration. All regions except for US-East-1 are clearly better than before, and while the results in US-East-1 are a little confusing, performance is better than it was in February, and from some perspectives is comparable even to the “lucky good period” that we had.