Embracing SRE as an incomplete approach to reliability

Model error is a fact of life. We operate in a world of stunning complexity, with tens of millions of people producing software and billions consuming it. We build on top of rapidly evolving hardware; today a top-of-the-line cell phone has more resources than a world-class supercomputer did 20 years ago. To iterate quickly in the face of this complexity, we have no choice but to base our decisions on imperfect summaries, abstractions and models.

When we construct models of reliability for our sociotechnical systems, we tend to play shell games with our complexity and ambiguity, hiding it wherever seems least offensive. The Shingo prize-winning Accelerate “asked respondents how long it generally takes to restore service when a service incident occurs” and found that lower mean time to restore (MTTR) predicted (better) organizational performance. But when does a disruption become an “incident”? And when is service “restored” [1]? Complex systems in the real world inevitably run degraded and with latent failures present.

Site reliability engineering—as defined by Benjamin Treynor Sloss, who evolved the discipline at Google beginning when he joined in 2003—has a core tenet of “Pursuing Maximum Change Velocity Without Violating a Service’s SLO” in order to resolve a “structural conflict between pace of innovation and product stability.” There are many interesting threads to pull here. Two that I find fascinating, but which, sadly, are outside the scope of this post are: (1) other ways in which the conflict between innovation and product stability might be solved (perhaps add stability to product manager performance reviews?); and, (2) the wide spectrum of ways in which SRE is implemented in the wild [2]. 

What I believe is more impactful is (Sloss’ definition of) SRE dodging substantial complexity by (implicitly?) arguing that a well-selected, well-monitored collection of SLOs that are all green is sufficient for a system to be reliable [3]. This is a misconception which puts us at risk of surrogation, of mistaking our map (SLOs) for our territory (reliability). The fundamental insufficiency here is that SLOs cannot protect us from dark debt: “unappreciated, subtle interactions between tenuously connected, distant parts of the system”. 

We see such dark debt in the “atmospheric” conditions that supported the perfect storm that formed in June 2019, when during GCNET-19009, multiple Google Cloud regions were disconnected from the Internet for over three and a half hours, more than 50 times the service’s monthly SLA of 99.99% availability [4]:

Two normally-benign misconfigurations, and a specific software bug, combined to initiate the outage: firstly, network control plane jobs and their supporting infrastructure in the impacted regions were configured to be stopped in the face of a maintenance event. Secondly, the multiple instances of cluster management software running the network control plane were marked as eligible for inclusion in a particular, relatively rare maintenance event type. Thirdly, the software initiating maintenance events had a specific bug, allowing it to deschedule multiple independent software clusters at once, crucially even if those clusters were in different physical locations.

In my view, then, SLOs are a tool we can use to align our organizations to avoid some especially painful ways we might make our users unhappy, like being down for too long or having a too-high tail latency of important endpoints. And indeed, many organizations have adopted SLOs and found that managing the reliability of their systems has become easier (example). Unfortunately, healthy SLOs, no matter their quality, are not enough to certify that the functionality we provide to our users will be safe or reliable in the future. As Cook says, catastrophe is always just around the corner.

Unsurprisingly, there’s no silver bullet here (although you can always consider improving your postmortems). However, a clearer view of the nature of reliability, supports more informed decisions on how to balance tradeoffs, resources, and prioritization as we seek to innovate quickly and reliably. I hope to learn more about how approaches to robustness (defense against known failure modes), such as SLOs, and resilience (unknown failure modes), can compose to improve overall later this week at REdeploy!

Thanks to Rein Henrichs for feedback on this post.

[1] John Allspaw has a pair of interesting blog posts on these topics: Moving Past Shallow Incident Data and Incidents As We Imagine Them Versus How They Actually Happen

[2] Seeking SRE is a great source on the SRE multitudes.  Google also has a blogpost on the topic.

[3] To be precise, these must be achieved with low toil, but toil is orthogonal to our discussion.

[4] This SLA, based on which Google issues refunds to customers, is almost certainly stricter than the internal SLO used by SREs. 

To reply you need to sign in.