New Security Features

We have recently added some additional features to MemCachier. In brief, you can now have multiple sets of credentials for a cache, can rotate credentials, and can restrict some capabilities on a per-credential basis. These features are all controlled from the Credentials panel of the MemCachier dashboard for your cache:

credentials-panel

The features described in this article are also explained in the MemCachier documentation.

Credentials

Caches may now have multiple sets of username/password credentials. The credentials for a cache are listed in the table in the Credentials panel on your cache’s analytics dashboard. New credentials for a cache are created using the Add new credentials button, and credentials may be deleted using the X button next to the credentials row.

One set of credentials is distinguished as primary, while the rest are secondary credentials. For caches provided through our add-on with a third-party provider (e.g., Heroku or AppHarbor), the primary credentials are linked to the configuration variables stored with the provider. That is, these are the credentials you see when you look at your application’s environment.  Secondary credentials can be promoted to primary using the up-arrow button next to the credentials row. When secondary credentials are promoted to primary, the username and password for the credentials being promoted are pushed to the provider’s configuration variables.

In practice, this means that you can rotate the credentials for a MemCachier cache associated with a Heroku app by creating a new set of credentials and promoting them to primary. This will push the new username and password to the MEMCACHIER_USERNAME and MEMCACHIER_PASSWORD configuration variables in your Heroku app, and trigger a restart so your application will pick up the new MemCachier credentials.

This ability to have both a primary and many secondary credentials lets you to rotate your credentials with zero downtime!

Capabilities

We’ve further enhanced multiple credentials by adding a security extension to our memcache implementation: capabilities! These control the operations available to a client once they authenticate with the chosen credentials.

Each set of credentials for a cache has associated write and flush capabilities. By default, new credentials have both capabilities, meaning that the credentials can be used to update cache entries and flush the cache via the memcached API.

You can restrict a client to read-only access by switching off the write capability for its credentials, and prevent a client from using the memcached API to flush your cache by switching off the flush capability.

Per-credential capabilities are managed using the checkboxes on each credential row in the Credentials panel: if the checkbox is checked, the credential has the capability; if not, the capability is disabled for that credential.

Dashboard SSO rotation

The MemCachier dashboard for your cache is accessed via a persistent unique URL containing a cache-specific single-sign-on hash. For security, you may wish to rotate this cache-specific hash to generate a new unique URL for the cache dashboard. The Rotate SSO secret button in the Credentials panel of the dashboard does this hash rotation. Pressing this button generates a new hash and redirects the dashboard to the resulting new unique URL.

Security @ MemCachier

We value security and hope that you find these new features valuable to improve your own practices. As always, for a guide to MemCachier’s own internal security practices, see our documentation.

Flush Command Logging for Heroku

We are happy to announce a new feature for our Heroku customers. In the past we have had several requests from customer who wanted to know why their caches had been flushed. To help our clients find out how a flush command came about we now push a log message to the Heroku log whenever a cache is flushed.

The log message contains the hostname of the proxy the flush command was executed on as well as its origin. The origin can be one of the following:

  • memcache client: The flush command originates from a normal memcached client.
  • web interface: This happens when a client clicks “Flush Cache” on the analytics dashboard.
  • admin operation: MemCachier flushed the cache on your behalf. This can occur when you do perform operations like switching from one cluster to another.

We hope you find it useful! We’ll be looking to add more information to the Heroku log in the future and would love suggestions and feedback on this.

ASCII Protocol Support

We’re happy to announce the general availability of a new feature on MemCachier, the ability to connect to our servers using the memcache ASCII protocol! Our official documentation covers it here.

Previously, this wasn’t supported as the ASCII protocol provides no standardized way to do authentication. We added our own simple scheme. When you establish a connection, the first command you must send to us is a set with your username as the key and your password as the value. For example:

$ telnet 35865.1e4cfd.us-east-3.ec2.prod.memcachier.com 11211

> set 15F38e 0 0 32
> 5
2353F9F1C4017CC16FD348B982ED47D
> STORED

The STORED line indicates successful authentication. After this, you can issue all the familiar memcache ASCII protocol commands! You will need to send the set command fairly quickly though, as otherwise we timeout the connection.

One thing to be careful of is only running this over the internal network of the IaaS provider you are hosting MemCachier with (e.g., AWS EC2). Why? Those networks are secure against eavesdropping and so the unencrypted connection works fine. If running over the Internet, you’ll want to use our TLS support for securing the connection.

Beyond running ad-hoc queries yourself using telnet or similar, supporting the ASCII protocol also opens up the number of clients and frameworks you can use. Just make sure to modify them to send the set command for authentication on first connection. As an example, we created a django caching backend that uses the ASCII protocol but works with MemCachier!

New Analytics Dashboard and Cache Migration

We are happy to announce a new design for the analytics dashboard and the ability to move your cache from one cluster to another in emergencies.

 

Screenshot from 2016-11-15 09-30-30.png

 

In August, we had the biggest incident in MemCachier history which affected two clusters on Amazon in the US-East-1 region simultaneously. During our post-mortem analysis we realized that, in extreme cases like these, a significant number of customers would prefer to move to a different cluster even if it means losing the contents of their cache.

As part of a wide range of improvement to better handle such extreme outages again, we’ve built out and decided to expose to customers the ability to move their cache to a different cluster.

The feature can be found in the redesigned analytics dashboard and is available to all paying customers. When used, the cache will be moved to the least loaded cluster in the same region and data in the old cluster will be flushed. The transition is seamless, meaning the configuration of the memcache client does not change. However, the DNS record change might take up to 3 minutes to propagate. It is important to note that this feature is meant to be used only in emergencies! Please always refer to http://status.memcachier.com to see if the cluster you are on is experiencing problems.

We are also changing how we create DNS records for customers as part of this change. While you used to be able to tell what cluster you were on from the DNS records, now the analytics dashboard will display this information.

One other use of this feature other than as a last resort in downtime, is for creating a second cache and ensuring your two caches are on independent clusters. If the two created caches happen to be on the same cluster, simply move one!

Rest assured that we’ve improved and continue to improve our ability to handle extreme outages without your involvement or loss of any cached data. But, it’s always nice to have a last resort.

Trouble in US-East – Postmortem

What happened?

Around 3pm PST on Monday, we started seeing very high connection counts to two proxy servers in one of the clusters in US-East. Normally, this correlates with either a poor network connection that causes lots of connection timeouts and re-connections from clients, or a handful of clients making an abnormal number of connections. In those cases we can simply replace an affected machine or reroute traffic to balance the load out.

Unfortunately, we were not able to attribute the drastic rise in connection count to either poor performance of a machine or heavy load from any particular subset of clients. We did see abnormal network behavior, with most latency falling in the usual ~0.3ms range between machines, but around 10-20% of requests were in the 5-15ms range. This, however, was across all machines.

Sporadically, the high number of connections moved from one proxy to another in the cluster. We replaced several of the proxy machines and managed to restore stable operation to that cluster within about 30 minutes.

However, shortly afterwards, two different clusters began exhibiting similar behaviors. These clusters are isolated: they don’t share any resources and are in different availability zones. Similarly, there were no obvious signs of issues other than some minor increases in network latency. Network bandwidth, CPU load, etc, was all normal and we haven’t deployed new software to these servers.

We eventually replaced nearly all of the proxy machines and some of the backend machines in US-East. We also increased the total number of proxies in each cluster.

In all, the incidents last until around 11pm PST on Wednesday morning. While service for most customers returned to normal within an hour, many of you were affected sporadically or continuously throughout the entire incident.

What caused this?

We believe it was two issues. First, the network started behaving abnormally, with latency for a small subset of packets going from 0.3ms to 5-15ms and dropping a portion of packets. This was occurring across the board, not just to a few machines. Second, a number of ways in which our system responds to failure that actually amplified this particularly rare issue. For example, we log connection errors between servers. During a widespread outage like the one that occurred, this means millions of log entries. Worse, we log the the struct that identifies a failed server, which includes a list of outstanding requests to that server. Similarly, we continuously record the connection count for each server using the lsof command, which isn’t particularly efficient. Typically, this isn’t a problem, but when connection counts are unusually high, it doesn’t help an already overloaded CPU trying frantically to free unused file descriptors.

We also made human errors that hurt a subset of clients. In addition to launching new servers and migrating load, we added IP filtering rules to help mitigate the load temporarily. This helped, but meant that new connections from some legitimate clients were blocked.

What’s high availability?

Many of you reached out to support for clarification on what our kind of availability our high-availability plans mean. High-availability today means you have access to more than one proxy in the same cluster. Should one fail, the other is available. Importantly, each proxy has the same view of the underlying cache, so a transient network failure or individual server crashes can be masked by your client library without losing access to any portion of your cache. This works well in the vast majority of cases, where individual server fail or experience high network latency. However, it doesn’t help in cases like this one, where entire clusters were affected.

Moving forward we’re investigating if and how to offer plans that span multiple clusters. If you have suggestions, please email them to support@memcachier.com.

What’s next?

First, we are completely revising how we do incident response. We are establishing codified procedures to step through for all incident types now, including cluster wide failures like the one this week.

Second, we are fixing issues in our systems that amplified the underlying issue and reviewing other technical ways to  prevent issues propagating through the cluster.

Third, we’ve normally tried to skew towards mitigating cache loss and resolving network issues as quickly as possible by re-balancing load and replacing individual machines in a cluster. While we will still do this for most incidents, we are putting in place a much more aggressive course of action to take when issues actually start affecting customers. In particular, starting today, we will skew towards resolving connectivity issues as fast as possible at the expense of flushing parts of the cache.

We’ve already deployed changes to internal tools that make this process quicker and more automated and we’re working on additional changes to our monitoring infrastructure the will help identify cases like this one that require more drastic measures.

We are also improving our tooling around communication. This was the worst outage we’ve ever had, and resulted in a breakdown of how we communicate with customers. We’re going to integrate our status page, Twitter account and monitoring systems so that we can inform our customers of the ongoing issues much more quickly and seamlessly.

We’re Sorry

We are sorry for the outage and the impact it had on our customers. This is the worst one we’ve had in almost 5 years of operation. We’d appreciate your feedback, even when it’s negative. Our doors are always open at support@memcachier.com.