REdeploy starts tomorrow [1]. I am really looking forward to not just great talks—which, thanks to the miracles of modern technology, I could watch on YouTube—but meeting members of the resilience engineering community in the flesh. There’s a huge number of folks who I have been in discussion with online over the past few months who it will be great to discuss (argue?) with live.
One thing which has remained constant throughout my investigation into resilience engineering is that no matter how many answers I find there are always more questions. The night before REdeploy is no exception. I look forward to contrasting the questions I have going in with the insights I take away from the conference.
Complex systems evolving with multiple layers of defense is one of the first observations of How Complex Systems Fail. Since Cook wrote this paper in 1998, there has been an incredible amount of innovation in the software industry. Technologies like cloud computing and smartphones, which did not exist in ’98, are now central to how much of the industry builds software. Also, while Cook argues that “The high consequences of failure” is what leads complex systems to evolve defenses, much of the tech industry is—or considers itself to be—low consequence when compared to the domains like the airline industry, medicine, or power generation.
So, then, might software organizations (especially hyper growth companies fueled by venture capital dollars) be able to leverage advanced components (AWS, iPhones, …) to build out complex systems which are not well defended against failure? These systems could have an abundance of “bright debt” to go along with their dark debt. For example, when asked recently at Papers We Love SF how common Chaos Engineering is in industry, Jessie Frazelle replied that it was rare—because in her experience companies had an abundance of urgent fires to fight and no bandwidth to learn from additional chaos.
More generally and abstractly, I am interested in the boundary between robustness and resilience. How do these subtypes of reliability interact? What heuristics can be used to allocate scare resources between the two areas? Are there practices that improve both robustness and resilience, posing less of a tradeoff and more of a win-win?
In their Debriefing Facilitation Guide, Etsy writes that “The Goal Is to Learn” and that “Blameless postmortems drive a significant percentage of our development”. The resilience engineering community believes in learning from deep study of incidents; as Nora Jones explains
Incident Analysis is completely different than the standard postmortem process that you see written about in the Google SRE book and other incident marketing materials. It is a whole field of study and practice on extracting valuable data from incidents focusing on how.
My brow furrowing begins when I imagine trying to explain the “ROI” or “business value” of incident analysis to an executive. The Etsy guide does provide some guidance on postmortem success:
We have come to think about two very basic success metrics when it comes to facilitating debriefings: the acquisition of new information that affects future work; and attendees’ willingness to participate in future debriefings.
Sadly, in the context of OKR driven organizations making resourcing decisions, “acquisition of new information that affects future work” feels ethereal compared to the tidy and crunchy SLOs of Site Reliability Engineering. John Allspaw concluding that resilience is “proactive activities aimed at preparing to be unprepared, and here’s the key part, without an ability to justify it economically” does not give me confidence that this knot will be untangled anytime soon.
The consensus of the resilience engineering community [1] seems to be that the solution to this conundrum is “flying under the radar”. Roughly speaking, this involves doing the work without asking permission until finding sufficient success, and then arguing for resources on the basis of that experience. I hypothesize that better examples of and insight into operationalizing what we learn from incidents is key to increased adoption of resilience engineering in software organizations, and I hope to explore this idea at REdeploy.
[1] Written the night before REdeploy but published a bit later.
[2] As opined by Ryan Kitchens in his recent InfoQ podcast, Lorin Hochstein on Twitter, and John Allspaw and Richard Cook in private communication.
To reply you need to sign in.