How we live migrated a 150 node cache

Ian ~ July 3, 2017

Tags:

In April, we announced our plan to migrate all of our Amazon Web Services (AWS) clusters to new VPC-based infrastructure on Elastic Cloud Compute. As we wrapped that up this past month, we documented the process to migrate a large multi-tenant, distributed cache service online as well as some lessons learned.

But why Virtual Private Cloud?

Amazon’s VPCs have been the new de-facto infrastructure for EC2 instances for a while. Many newer EC2 instance types can only be launched inside a VPC and VPCs enable some important forward looking features like IPv6 connectivity. The aim of the migration was to move all of our clusters to the new VPC-based EC2 instances. We wanted to rationalize the instance types that we use, switching over to using newer and more perfomant instance types, as well as reorganize all clusters to use VPCs with fine-grained security group rules.

“Trust the Process”

To explain what we did during the migration, it helps to know how MemCachier uses AWS. User caches are provisioned into a MemCachier cluster, composed of several EC2 instances, each running one or more cache backend servers and a proxy server. Clients connect to a proxy server, which manage forwarding of memcached protocol requests to backends. The number of cache backends an instance runs depends on the instance type. Clusters vary in size from one or two instances (for development clusters) to 12 instances. We have clusters in each AWS region where we have customers (six regions at the time of writing).

Across our clusters, we have about 70 EC2 instances of different sizes, running about 150 cache backend servers. Within each cluster, user caches are sharded across all backend servers. Restarting a backend server evicts the fraction of each user cache managed by that backend. This means that this kind of large-scale migration has to be done gradually: most user applications can deal with gradual eviction of cache items far more easily than they can with a single “big bang” flush of the entire cache.

Automation and testing

To migrate all instances across all supported AWS regions required more than 300 steps, where a step might be “launch and configure a new instance”, “start a new backend server”, or similar, each step involving several sub-steps. Doing this manually would have been too error-prone. Instead, automating this process was key to successfully performing these steps in the right order, error-free.

Luckily, we had already automated many of the individual steps involved (e.g. launching new machines, migrating DNS records, adding/removing nodes from a cluster), but we built a new state machine-based migration tool to sequence everything and to combine individual system administration actions into larger logical steps. We tested this tool extensively on our staging environments, then used it to migrate development clusters first, to ensure that it worked correctly, before moving on to production clusters.

The result was that we had no problems during the migration that were due to human error, mis-sequencing of migration steps, or missing sub-steps within the migration plan. This was particularly important because we migrated many clusters in parallel, which made managing the overall state of the migration even more involved.

To resolve, or not to resolve?

From what we can tell, there were very few problems caused by the migration: some customers noticed and asked about the partial cache flushes that resulted from backends migrating to new machines, but there appeared to be very little disruption.

However, not everything in the garden was rosy. We have one quite serious problem during the migration. One of our larger customers noticed that they were getting inconsistent values from one of their caches, something that appeared only after we began migrating the cluster their cache is hosted on. We tracked this down to an undocumented behavior of DNS name resolution on EC2 instances (later confirmed by Amazon).

Each EC2 instance within a VPC has both a private IP address, accessible only within the VPC, and a public IP address. Intra-security group rules (i.e. “allow all traffic between instances within this security group”) work by matching the private IP address subnet, so servers within a MemCachier cluster use private IP addresses to talk to one another. An instance’s public DNS record resolves to its private IP address within a VPC since EC2 provides a DNS resolver specific to the VPC. This means that one instance within a VPC can connect to another using its DNS name and intra-security group rules will apply correctly!

The problem that we discovered was that the DNS record mapping the public DNS record to the private IP can take up to several minutes after a new instance is launched to propagate. During that period, the new instance’s DNS name resolves to its _public _IP address from other instances in the VPC. Of course, trying to connect to the new instance with the public IP address wouldn’t work, because our intra-security group rules restrict access to the VPC’s private IP subnet.

As a result, when cache backend servers in the cluster were notified of a new instance and tried to connect to its proxy server, the new instance’s DNS name sometimes resolved to the public IP address and the connection failed. This meant that different proxy processes within the cluster ended up being connected to different numbers of cache backends, which led to inconsistent views of some caches.

Once we had identified the problem, avoiding it for the rest of the migration was easy, since we have a mechanism to force all backends in a cluster to refresh their proxy connections, but it was a big problem for this one customer in particular. We have now also worked around this in our code, so we won’t be hit by this again.

Lessons learned

There were three main lessons we drew from all this:

Scheduling: We did the migration over the space of a couple of weeks that spanned the end of the month, which was a mistake. Some customers had end-of-month processes to run and it would have been better to avoid the few days either side of the end of the month to ensure that there was no migration-related performance degradation for them.
Announcements: We announced the migration in a blog post and on Twitter, and we contacted all of our larger customers by email, but those emails could have been more explicit about asking customers to let us know about any special jobs they needed to run during the migration period.
Extra cluster diagnostics: The biggest problem we had (the DNS name resolution issue) could have been detected and alerted with better cluster diagnostics. We’ve had an issue open for a while to add some of those things, and it will definitely be done before we do any more large infrastructure changes.