Two philosophies on achieving reliability

I see two major philosophies that underpin [1] approaches to achieving reliability [2] in the software industry. In brief, they are

  • Reliability is achieved through engineering quantitive systems and processes that provide correct choices to tradeoff decisions

  • Reliability emerges from nurturing an ecology that learns to adapt to changing conditions

Based on loose parallels to Hollnagel's safety science work, we might name these reliability-i and reliability-ii. Unfortunately, this could implicitly suggest that reliability-ii is better than, "after", or completely distinct from reliability-i. Instead, I'll designate them “angled reliability” and “curved reliability” [3]. My main observations from these definitions are

  • Angled reliability is (relatively more) aligned with SRE, SLIs, etc [4]

  • Curved reliability is (relatively more) aligned with modern approaches in safety science like resilience engineering and recent work on incident analysis in software

  • Industry has generally focused on angled reliability

  • Explicit, visible industry efforts in curved reliability have been few and far between [5]

Based on my explorations over the past few months, I believe (vs observe) that

  • The increasing velocity and complexity of the software industry will lead to curved reliability increasing in importance and value

  • A major challenge to organizational investment in curved reliability is the ability to make its value legible to leadership while simultaneously avoiding oversimplification

  • "Success" involves the composition of angled and curved reliability

If this resonates or if you have thoughts, drop me a tweet? This writeup sacrifices completeness for brevity—I'm happy to get into the details if you have questions! You may also be interested in the blog and/or newsletter of Learning from Incidents in Software.

[1] via fundamental beliefs and assumptions, e.g. about system linearity, the existence of human error, etc

[2] however you'd like to define it! A colloquial, common sense, and/or slightly ambiguous definition should be fine for our purposes.

[3] To me, “angled” responsibility evokes the “crunchy” nature of metrics & etc. Plus, slightly clunky names may help avoid buzzword-coining.

[4] here we'll use Google's definitions from the Site Reliability Engineering book

[5] For an example see Lorin Hochstein's article on OOPS at Netflix

To reply you need to sign in.