Cynefin is a framework created by Dave Snowden to help people understand systems and assist decision-making. Cynefin identifies five domains from which people can analyze a system or situation: Obvious, Complicated, Complex, Chaotic and Disorder.
Managing a system from the context of a domain is important because the framework guides the analysis and decision making processes. Different tactics are appropriate for sensing information and making changes to a system in each of the five domains. This post will explain the purpose and usage of each of these domains and illustrate usage with a familiar service: file backups.
In the Obvious domain, the situation is stable. Cause and effect are well understood, so people solve problems by application of rules. Those new to an Obvious system can learn the rules of the system and even derive their own solutions quickly. Problem solutions generally converge to a single best practice.
The Obvious domain’s management approach is: sense, categorize, respond. As an example, when administrators of a backup service are warned (sense) the storage system has reached 70% of capacity, they categorize the system as being in a low-storage condition, and respond by removing all backups older than 90 days.
But what if the backup service’s workload (situation) changes so that some of the files being backed up are large and volatile. Removing old backup files may no longer free up enough storage to resolve the low-storage condition because the largest files are new. Things just got complicated.
In the Complicated domain, an expert must analyze the situation to relate cause and effect. The expert responds to the problem by applying one of several possible ‘right’ solutions. The complicated domain’s management approach is: sense, analyze, respond. Returning to the backup example, an expert administrator might analyze the backup archive and discover it now contains large, unimportant files. The administrator could update the backup process to exclude these files entirely or update the backup purge script to retain these files for only 7 days while keeping backups of important files for 90-days.
In both the Obvious and Complicated domains, we expect solutions to converge the system back to the target state reliably.
Now consider what happens when a product manager notices that the internal backup solution works quite well and wants to launch it “as a service” to external customers. The backup service grows many new features to handle the needs of the first 10 customers. What started as a simple client wrapping rsync and a single server attached to a SAN is now a system with more than 10 software components and three engineers managing the security, scaling, availability, and durability of customer data. This example system is well architected and built with each component for managing a single distinct responsibility that complements other components. The implementation is verified with automated unit and functional tests. The “Minimum Viable Product” (MVP) is finally launched.
Each of the backup system’s components acts independently to fulfill its responsibilities. For example, the component that scales the system up and down to handle the load of customer backup jobs acts independently from the component that ensures data is replicated the minimum number of times. Except the scaling and durability components are not really independent. When the storage backends expand to accept a large amount of new customer data, the replication mechanism will certainly be involved in placing data shards there. Likewise, when the system scales down, the replication and scaling components will need to coordinate to compact data onto fewer storage nodes safely.
In Understanding Complexity, Scott Page, Professor of Complex Systems at the University of Michigan defines a complex system as:
an environment with interdependent, diverse, connecting, adapting entities.
We built the backup system from a collection of interdependent components and people with distinct responsibilities by necessity. These entities are connected via assumptions, data flows, constraints, and actions. As each entity reacts to inputs and adapts to optimize its area of responsibility the system’s overall behavior emerges. And…
Huzzah! The system (mostly) works!
The MVP works and has also moved into the Complex domain. The system’s entities are now interdependent and adapting to each other’s behavior. We can only understand the system’s behavior in retrospect, if at all.
In the Complex domain, outcomes are less certain. The recommended approach for managing Complex systems is: probe, sense, respond. Let’s work through an example where we want to expand the backup service’s capacity to support 100 customers, a 10x increase from the MVP system.
Many bottlenecks and invalidated assumptions are likely to be discovered while growing a Complex system 10x. The way to find them is to probe the system with a small, low-risk change, check if the system’s performance has improved, and decide whether to keep going. If you’re thinking this sounds similar to Agile development and continuous delivery techniques that get changes to customers quickly so we can (in)validate them, you’ve made a great connection.
In a Complex system, it’s unlikely that expanding capacity in a single component of the architecture will suffice. Expanding the backup system’s storage components will certainly required, but is insufficient on its own. Throughput will likely be a bottleneck in other entities such as the network. The algorithms used to coordinate actions amongst components and even procedures used by Operations teams will need to change in order to manage 10x more storage and customer traffic. You won’t really know where all the constraints are in your system before you try to grow and operate it under realistic conditions. Probe, sense, respond can help you find and address system constraints in a methodical way.
Complex systems can be generally robust to common failure events such as the loss of a single storage node. Designers will address pain with a clear, frequently occurring cause. However, the system is susceptible to incidents driven by ‘corner cases’ where multiple entities are pushed out of a normal operating range and the system can no longer adapt itself back to a steady state without help. This is why the narrative of a system incident often tells a tale like:
- Storage node 2 went offline at 23:50 Pacific for unknown reasons
- The replication component detected the failure in 46 seconds and started redistributing six shards from the failed node across the four remaining nodes
- Our two largest customers started their nightly full backup jobs at 00:00 Pacific (10 minutes after #1) and their data was assigned to the shards being redistributed
- Network bandwidth between storage nodes was saturated
- node 3’s network transmit was pegged at 100% as it hosted copies of shards being redistributed.
- nodes 1 and 4 were saturated their receive bandwidth handling the replication, backup processes, and some normal restoration activity started by customers on the east coast
- Customer backup and restore jobs have slowed to a crawl and most are timing out
- Incident declared 45 minutes after #1: System is unavailable for customer use
The recommended approach for managing a Chaotic systems is: act, sense, respond. Experienced administrators must use their knowledge, intuition, and what is known about the current state of the system make a best guess as how to improve it. Then they must try that action right now, see if it pushes the system back towards the target state and try something else if it doesn’t. Time is of the essence.
In this event, Administrators could halt the least important activity consuming network bandwidth. Administrators might cancel one or both of the full backup jobs that started at midnight, but it really depends on the expectations the service has set with customers and if they have the ability to do so! Alternatively, it might be better to restrict the bandwidth available to the replication process if data is not at significant risk. In either case, Administrators will need to take action based on their knowledge of the system, see if the system improves, and then do more of that or something else.
Once the system is restored to normal service, a Learning Review (aka Post-mortem) should be conducted to understand more about the incident. The review team can recommend improvements to constraints or practices that may keep the system operating safely in the Complex domain.
Failure testing, aka Chaos Engineering, is a practice used to explore a system’s operational state space in a quick and directed fashion. In failure testing, engineers inject specific failures into the system to see how it reacts and test recovery procedures. These controlled tests can be run in test or (eventually) production environments so that you can learn about and improve the robustness of your system to unknown-unknown events continuously, instead of only in response to a unmanaged excursion into the Chaotic domain.
The Disorder domain represents situations when it is unclear which of the other domains applies. Leaders can navigate a situation out of the Disorder domain by breaking the system or problem into its components and assigning each of those to the proper domain.
Cynefin helps people understand and reason about systems and situations in a structured way. With this framework, you can determine how much faith to put in a mental model of a system, how to generate ideas to improve or repair the system, and the relative probability of success of an action. Hopefully you can see how successful systems naturally move towards the Complex domain as they grow and how constraints and continuous learning can keep a system out of Chaos.
- A Leader’s Framework for Decision Making, Snowden & Boone
- Cynefin for Everyone, Keogh
- Cynefin Framework, Wikipedia