If you’ve noticed some problems with your current incident process and want to improve it, know that you don’t have to (re-)build from scratch.
DevOps has sparked a lot of improvements to incident response processes and a lot of this work has been shared in public.
One particularly good example to learn from is PagerDuty’s Incident Response process. (fyi…PagerDuty is a SaaS that enables effective incident management processes and I have no relationship with them). They have published their own incident management process, training, and documentation materials as something you can learn from or adopt and modify:
There’s a lot to like about the process and docs, starting with clear definitions for:
- roles and responsibilities in an incident
- the protocols of an incident response, particularly how:
- Incident Commanders delegate tasks to Subject Matter Experts
- information should flow and decisions are made
- people can change roles during as an incident develops and how escalation happens
- a set of Severity levels and typical response
- the Principles of Alerting (spoiler: only alert people for things that require human action)
This incident response model is derived from the US National Incident Management System (NIMS), which is described as:
A systematic, proactive approach to guide departments and agencies at all levels of government, nongovernmental organizations, and the private sector to work together seamlessly and manage incidents involving all threats and hazards—regardless of cause, size, location, or complexity—in order to reduce loss of life, property and harm to the environment.
Previously, I suggested your incident response process should prioritize customers’ goals for incidents:
- safety for the service’s customers and their data
- restoration of critical functionality
- restoration of non-critical functionality
How does this response model achieve that?
All of the Severity definitions are tied to customer impact.
For example, a ‘SEV-1’ issue is defined entirely in terms of impact to customers:
Critical issue that warrants public notification and liaison with executive teams.
- The system is in a critical state and is actively impacting a large number of customers.
- Functionality has been severely impaired for a long time, breaking SLA.
- Customer-data-exposing security vulnerability has come to our attention.
Additionally, the typical response for each Severity level is described, plus a reminder for where and how to perform that.
Here’s a quick exercise I’d love your feedback on:
Take 5 minutes (or less) to scan the PagerDuty Incident Response site.
Reply to this email with answers to:
- Do you have a written, updated incident management process?
- What, if anything, would you like to adopt from the PagerDuty Incident Response model?
I promise to summarize the results and publish them.
Have a great, incident free weekend!
Receive #NoDrama articles in your inbox whenever they are published. Reply to Stephen and the QualiMente team when you want to dig deeper into a topic.