What does a good response look like for an incident in a Complex system?
Start by establishing a system for classifying incidents in terms of customer and business impact. Then establish a process to manage the restoration of service that prioritizes the following external goals:
- safety for the service’s customers and their data
- restoration of critical functionality
- restoration of non-critical functionality
Let’s illustrate this with an incident at GitHub you might remember. In late October, 2018, GitHub experienced a major incident in which the site was fully and partially unavailable for just over 24 hours. The incident was caused by a network partition that affected their main database clusters. These clusters have hundreds of MySQL instances in aggregate and are responsible for the data supporting the rich customer experience provided by GitHub. So this was a very big deal.
Here is an excerpt from one of the initial public communications :
Out of an abundance of caution we have taken steps to ensure the integrity of your data, including pausing webhook events and other internal processing systems.
We are aware of how important our services are to your development workflows and are actively working to establish an estimated timeframe for full recovery.
As someone who trusts GitHub with both business and personal work, this message directly spoke to me.
- something big is going on
- engineers and management have assessed the situation and determined the risk to data is high
- they recognize taking down the site is going to impact customers, but Safety trumps SLAs
- they’re going to inform customers as they learn about the incident
While the outage was inconvenient for me, this incident grew my trust in their engineering and leadership. This response told me that a mature engineering organization staffed by professionals was on the job, and I didn’t really need to worry about it.
A week later, GitHub published a detailed analysis of this incident describing the background, timeline, and areas they are planning to improve. That analysis is well worth reading for understanding what goes on underneath the covers of a well managed incident as well as the huge problems network partitions can cause for leader election protocols.
In the next post, I’ll share some ways you can improve your own incident response processes quickly.
Receive #NoDrama articles in your inbox whenever they are published. Reply to Stephen and the QualiMente team when you want to dig deeper into a topic.