AWS Ohio region simplifies cross region failover

Amazon Web Services just announced the availability of the Ohio region, us-east-2.  This is the second region on the east coast of the United States and the first new region to come up near the original and largest North Virginia region. This announcement prompted me to think about one of the early stories of how AWS was enabling people to do things with technology that were impossible previously.

The technology team on the 2012 Obama presidential campaign spoke at re:Invent 2012 and told a story about their preparations for Hurricane Sandy. In the aftermath of the storm, while many other companies were physically moving hardware into new datacenters and struggling to stay online, the Obama campaign was unaffected.  They had moved their entire technology stack across the world, 27 terabytes of data in total, in 4 hours.  Although AWS ultimately remained online throughout the storm, the Campaign had peace of mind that their operations would not be affected.

That story was inspirational and transformative in 2012 but it is now 2016 and cloud technology has become less a novelty and more the default choice for both old and new companies. However, it remains difficult to support multi-region failover. The Ohio region changes all of this and opens up the possibility for easy multi-region failover in the US east coast.

AWS promotes the concepts of regions and Availability Zones (AZs). Regions are made up of multiple AZs separated by vast distances and are connected by WAN technology to each other. Connectivity between regions is either over the internet, or over Amazon-owned long haul fiber links. The latency between regions runs from the 10s to the 100s of milliseconds.

Availability Zones, on the other hand, are separated by tens to at most around a hundred miles.  They have different connectivity, and utility service and are never on the same geologic fault lines (if applicable). Availability Zones are much closer together, and have a latency between them of typically less than a single millisecond. Also, they may fail or go offline, but because of the isolation between them, there should not be correlated failures between them.  A power loss in one AZ will not affect another. A backhoe cutting one AZ fiber line would not affect connectivity in another AZ.

Critically, Availability Zones are so close to each other that you can treat them as a fault boundary, and can synchronously replicate between them in your application.

Although the isolation an AZ provides protects you from many types of failures, there are still rare instances where a region becomes temporarily unstable. The problems occur with service disruptions related to software bugs rather than correlated infrastructure failures. In the early days of AWS, Elastic Load Balancers (ELBs) had several cross AZ outages. More recently DynamoDB had issues in North Virginia that affected several other services.

On the other hand, service disruptions across regions are so rare as to be unheard of. Having the ability to fail over to another region can provide an important safety net to companies that absolutely cannot afford to be offline.

Unfortunately, there are several hurdles to overcome in order to support cross-region failover. In most cases, these hurdles are either insurmountable or have tradeoffs that lead companies to never failover even when they have the ability to do so.

Traditional applications hosted in datacenters typically use relational databases that support transactions and guarantee data integrity. To synchronize these databases across regions, typically database administrators set up asynchronous replication between multiple data centers. Because of the latency between regions, the backup databases may run from several hundred milliseconds to full seconds behind the main databases.

hurricane sandy wreaked havoc on New York City datacenters

Hurricane Sandy Flooding in New York City

In the event of a crisis when a failover may be helpful, the operations team has to determine how far behind the data replication is in the secondary environment.  They have to determine whether the backup databases are in a valid state.  They need to make an assessment as to what data they will lose if they fail over.  They have to decide whether to wait out the outage or to risk the failover in these circumstances.

Typically, most companies choose not to fail over under these circumstances. It is often safer to wait for the environment to stabilize or for the technical team to fix the problem. The companies invest the entire cost of setting up a replicated environment, without taking advantage of it when it matters.  The cost of failing over in simply too high.

On the other hand, if they do fail over, these companies have to deal with the costs of lost transactions in the people time necessary to sort out the mess, in engineering time due to bugs related to missing data, and in the loss of customer trust that comes with any data loss – even temporarily.

More modern applications try to deal with these problems by using technologies that can withstand the latencies of a cross regional infrastructure.  They typically use NoSQL databases that support eventual consistency rather than transactional guarantees. Many will replicate customer database across regions and have sophisticated tools for handling potential data conflicts.

Netflix has written extensively on their multi-region architecture designed around the distributed Cassandra NoSQL data store. They have the ability to transfer traffic from region to region. These types of architectures are very complex to create, and exceedingly difficult to get right. World-class engineering organizations like the Netflixes of the world can invest in these types of architectures, but they are out of reach of smaller enterprises or companies wishing to run legacy applications.

This is where the Ohio region starts to change things. Ohio is far enough away from Virginia to protect against natural disasters like the Hurricane Sandy event the Obama campaign was trying to avoid. It is also close enough that the latency between the regions is only 12ms.

For some customers, it may be possible to synchronously replicate to the new Ohio region.  Databases are designed to handle typical hard drive seek latencies of around 9ms. Many or even most applications are read heavy and can handle slightly higher write latencies. Some applications can handle slightly longer write latencies with some common latency hiding techniques.

At these low latencies, cross region replication and failover can be supported with architectures similar to those used in Availability Zone failover.  It seems like this is AWS’s thinking as well, as they are charging for data transfer between us-east-1 and us-east-2 at the highly discounted cross Availability Zone rate.

With caching and latency hiding techniques, many applications could even use Route 53’s latency based routing and serve traffic out of both regions with a centralized datastore in either region with synchronous replication between them.

That kind of architecture can provide very high levels of reliability and can provide valuable protection against regional service outages and large natural disasters.

New regions like the us-east-2 region on the US east coast blur the distinction between the Availability zone and the region. The low latency between the regions can allow companies to protect against regional outages and also support low cost synchronous replication architectures. Amazon seems to be encouraging this kind of thinking with discounted data transfer rates between the two east coast regions.