Avg Reading Time: 3 minutes
Today, we’ll introduce the concept of a Failure Domain, which is very important to understand and use when building systems, especially those in the Cloud.
Let’s start with a high-level view of a layered application stack with dependencies running from top to bottom:
A failure domain is a set of resources that provide a service to users and operates independently of other failure domains that provide that same logical service.
In order to be independent, failure domains must not share resources, even (or especially) ‘generic’ ones like network or power. Since network and power are common sources of faults, fault boundaries often align to physical structural elements such as buildings, rooms, racks, and power supplies.
This independence is very powerful and means that if one logical instance of the service is having a problem, the others should not be affected by that same problem.
A failure domain identifies the scope within which the service provider expects certain failures to be contained and for certain availability and performance characteristics to hold true. i.e.
- within a datacenter, you can expect applications to succeed in connecting to each other with very high probability, and less than 10ms latency
- between datacenters, applications will encounter more failures and higher latency based on the distance between the datacenters (speed of light) and number and types of network paths
That’s pretty of abstract, so let’s illustrate it with an example.
Suppose you have an application that you deploy in a single datacenter and your applications are deployed on hosts connected by a single network backplane. The single network backplane is a shared resource that establishes a single fault domain for this deployment. Further, since the application is only deployed in this environment, the network backplane represents a single point of failure (SPOF).
When there is a problem with that shared resource, users will experience it as there’s no way for the larger system to contain the failure to a portion of the system. The network backplane is shared by 100% of hosts, so 100% of application instances will be affected.
Now, suppose your service provider builds out two additional datacenters that are just like the first datacenter and share no dependencies. These datacenters are independent failure domains that provide the building blocks for creating a more highly-available system.
You could redeploy your system across these three datacenters, ensuring data is available in at least two DCs:
Now, when that network backplane blows-up in the first datacenter (DC1), your application and related orchestration can:
- detect that connections to datasources are unhealthy in DC1
- signal to the applications network ingress that app instances in DC1 are unhealthy, usually via something like a load balancer health check
- (maybe) scale up application instances in DC2 and DC3 to maintain expected quality of service and account for the loss of app capacity in DC1
- (maybe) elect new database leader and rebuild replicas, if needed
(note: implementing HA across three datacenters is not ‘simple’; I’m trying to be brief, not minimize the effort)
At least a few good things come out of this.
First, there is a good chance customers may not notice this failure because impact has now been reduced to a 1/3rd capacity reduction.
Second, incident responders can approach this incident with lower urgency and stress than if the entire system (business?) was unavailable.
Third, you may have just discovered that you prefer terminating and replacing resources in one failure domain is a nice way to test recovery procedures.
Discovering failure domains
Perhaps you’re thinking that understanding your application’s failure domains seems important, but not sure how to discover them without creating a bunch of chaos.
One approach is to convert your application’s system architecture diagram into a dependency graph:
If any one of the links is completely broken, you likely have an “incident.” The further down in the dependency graph the issue occurs, the larger the incident will probably be.
Try identifying your system’s fault domains by diagramming your system and its dependencies. I’d love to hear what you find, especially if you discover a SPOF that you didn’t realize was there.
Receive #NoDrama articles in your inbox whenever they are published. Reply to Stephen and the QualiMente team when you want to dig deeper into a topic.