Est Reading Time: 5min
Cloud computing providers incorporate the concept of failure domains throughout their services.
Zones and Regions are some of the most fundamental examples of failure domains. In general ‘Cloud’ terminology:
A Zone is a single logical datacenter with its own power, network, and cooling resources. The Zone’s power, network, and cooling resources are independent of any nearby or collaborating Zones. One or more buildings may provide the physical space for the Zone’s compute and storage devices, however all of the buildings in the Zone are generally considered one failure domain because they share network/power/cooling resources and are subject to the same local physical risks such as weather and fire.
A Region is set of collaborating Zones (datacenters) grouped together based on their geographical proximity. The Zones within a Region are connected with high-speed networking to facilitate low-latency communication between compute instances and reliable replication of data. These connections are done in such a way that no Zone becomes a point of failure between for an unrelated zone. That is, Zone 1 and Zone 3 must be able to communicate even if Zone 2 is down.
Applications should be deployed across multiple Zones in a Region as a foundation for achieving high availability. Most high availability deployment architectures start by deploying an application into 3 zones so that when one zone has a problem there are at least two healthy zones serving customer traffic and keeping data safe.
Zone and Region Concept Across Cloud Providers
Let’s explore what Zones and Regions mean in each of the three largest Cloud providers: Amazon Web Services, Microsoft Azure, and Google Cloud Platform.
Amazon Web Services
AWS’ definition and implementation of these terms have shaped the larger industry usage, so let’s start with them.
In AWS, ‘Availability Zone (AZ)’ is the term for a single logical datacenter. The inclusion of ‘Availability’ in the term emphasizes several facts:
- compute and local storage are provisioned from within some physical structure
- while the resources available within a datacenter may be large, those resources are also finite, and may not always be available
A ‘Region‘ composes multiple, isolated Availability Zones which are in close geographic proximity and connected through low-latency links. Each AWS Region is designed to be completely isolated from the other AWS Regions to achieve the greatest possible fault tolerance and stability. The practical result of this is that while a critical service such as IAM, EC2, or S3 may have an incident in one region, e.g. N. Virginia (us-east-1), the incident does not spill over into other regions, e.g. Ohio (us-east-2) or Oregon (us-west-2).
AWS builds Availability Zones 10km to 100km apart. Latency between machines in different Availability Zones is generally less than 1ms, though you probably should model the maximum latency between AZs as 10ms when you are thinking about things like connection timeouts.
The physical proximity of Availability Zones is very important. Cloud providers need to balance correlation of risk with the ability to communicate quickly and reliably between availability zones. That is, we need to balance the risks of a single hurricane flooding multiple Availability Zones (put them far apart) with the need to replicate data between Availability Zones reliably using practices such as synchronous write replication (put them close together). Watch Adrian Cockroft’s talk on this subject for more details.
Let’s see how Google implements these same concepts.
Google Cloud Platform
GCP’s definitions for regions and zones are quite similar to AWS’.
Regions are independent geographic areas that consist of Zones.
Each Zone is a deployment area for compute resources such as virtual machines. Each zone should be considered a single failure domain within a region.
GCP provides precise guidance on network latency, stating that “Locations (Zones) within regions tend to have round-trip network latencies of under <1ms on the 95th percentile.”
An important difference between GCP and other Cloud providers is that a single virtual network (VPC) can span Regions in GCP. Think about what using this feature will do to your system’s fault isolation and IP management practices before using it.
Now let’s examine how these concepts apply in Microsoft Azure, which is a bit different.
Azure introduced Availability Zones in March 2018 to match the same concept in AWS and GCP. Previously, (I believe) Azure users had to rely on ‘Availability Sets‘ to improve application availability. ‘Availability Sets’ are a mechanism that places resources onto separate failure domains within a single datacenter.
Now, Azure supports the familiar “Regions are composed of multiple AZs” pattern. One thing to be aware of is that not all Azure Regions have multiple AZs. Of the 54 Azure Regions, only 9 currently support Availability Zones, though the US, lots of Europe, and Southeast Asia are covered.
Azure is in the process of enhancing its services offerings to be AZ-aware and take advantage of this availability building block. The Azure Availability Zone overview describes which services are available for use in AZs along with recommendations for using them.
Regions and Zones are foundational concepts to understand and apply when building Cloud applications. While many operators are accustomed to deploying an application into a single datacenter with perhaps a second datacenter available for backup or disaster recovery, the Cloud is different.
Deploying applications into multiple Zones, usually 3, within a Region is a straightforward best practice for creating highly available Cloud applications.
The major cloud providers provide a Region+Zone model to help you architect, deploy, and operate applications built upon isolated failure domains with well-defined behaviors. Learn the characteristics of your Cloud provider’s model and architect a deployment that will meet your availability goals prior to deploying your application. This should help reduce the surprises from an incident affecting a zone or even a region.
Receive #NoDrama articles in your inbox whenever they are published. Reply to Stephen and the QualiMente team when you want to dig deeper into a topic.