On Friday, J. Paul Reed sparked an interesting thread when he said that organizations that include anything resembling a “target resolution time” in their incident management process means the organization doesn’t trust their engineers to do their jobs manage incident resolution times to a target:
Kaimar Karu, the former Head of ITIL and author of ITIL Practitioner, responded with:
So something like this would be a bad indicator to determine the priority of the incident (as it competes with others) without having to (re-)negotiate the priorities with relevant stakeholders to determine business impact? Genuinely curious jtbc 🙂
Kaimar Karu
and it heated up from there.
My take is that J. Paul Reed is right in practice. This thread triggered flashbacks to all of the negative effects listed in the thread and more.
Increased Stress
If your engineers already understand an incident’s impact to the business, a resolution target is duplicative, at best. During critical incidents, engineers are (or should be) the ones gathering the data, using that context to determine severity according to a standardized severity scale, and explaining that context and appropriate options to managers.
Being reminded of an incident resolution target time when you’re trying to fix a problem increases stress on the engineers fixing the problem. The additional stress will degrade performance. Maybe not what the incident commander and managers wanted, is it?
Worse Decision Making
Sometimes managers rush their decision making or apply half-baked fixes that engineers advised against in order to hit a resolution target or incident SLA. Or my favorite example of insanity: “this incident has been running more than an hour, which is the resolution time for a P2 issue, so let’s downgrade it to a P3.” WAT?!
Some of these decisions will ‘work out’ with minimal effects (‘only’ reduced trust, perhaps).
What’s going on here?
The ability to fix a problem in a predictable amount of time is governed by the relative complexity of that system and situation. If you have an Obvious or Complicated situation, perhaps you have a runbook or an expert that can resolve the incident predictably. If neither of those statements are true, then the only thing you can be confident of is that your incident resolution times won’t have a nice predictable normal distribution.
So was Kaimar talking about Obvious or Complicated systems and J. Paul Read Complex systems? Maybe? Either way, I don’t see how a target resolution time is relevant for incident responders. Responders are going to use the knowledge and tools they already have on hand to resolve the incident as opposed to building response capability during the incident.
In my view, ‘target resolution time’ is a pernicious example of a metric that is often misapplied.
Measuring the wrong thing, or in the wrong place
There’s no point in duplicating business and customer impact of an incident with a target resolution time. If incident responders don’t understand or don’t care about the impact of an incident to their customers or their business, that organization is likely suffering from a lack of management and leadership. Driving understanding of what is important to customers and the business is a primary responsibility of managers and leaders. Further, if leaders can’t describe a scale of incident severity with sufficient precision that everyone can use it, how will they set the resolution targets? My guess is ‘arbitrarily’. And why would responders care? They won’t.
So are incident resolution times worthless?
I do not think incident resolution times are worthless.
However, they are more valuable as a measure of how much understanding and control you have over a (sub-)system. When the team building and operating a system doesn’t have understanding and control of that system, it’s a better measure of management, leadership, organizational culture, and system architecture than of the engineers on the incident call. (J. Paul Reed and Kaimar Karu might agree)
If you have long or unpredictable resolution times, you can use learning reviews (aka ‘post mortems’) and continuous improvement to improve understanding and reduce complexity of your system.
You may need to:
- add governing constraints to limit variation
- use fault injection to build experience with failure safely, recognize the signatures of certain problems, and handle them safely
- build robust tooling to handle common failure modes
- improve the monitoring of the system and interactions between sub-systems
- … and more …
As you stabilize the system, you should see the time to resolve incidents improve. However, incident resolution times are a lagging indicator of team and organizational performance, not constructive guidance during incident response.
#NoDrama