I see two major philosophies that underpin [1] approaches to achieving reliability [2] in the software industry. In brief, they are
Reliability is achieved through engineering quantitive systems and processes that provide correct choices to tradeoff decisions
Reliability emerges from nurturing an ecology that learns to adapt to changing conditions
Based on loose parallels to Hollnagel's safety science work, we might name these reliability-i and reliability-ii. Unfortunately, this could implicitly suggest that reliability-ii is better than, "after", or completely distinct from reliability-i. Instead, I'll designate them “angled reliability” and “curved reliability” [3]. My main observations from these definitions are
Angled reliability is (relatively more) aligned with SRE, SLIs, etc [4]
Curved reliability is (relatively more) aligned with modern approaches in safety science like resilience engineering and recent work on incident analysis in software
Industry has generally focused on angled reliability
Explicit, visible industry efforts in curved reliability have been few and far between [5]
Based on my explorations over the past few months, I believe (vs observe) that
The increasing velocity and complexity of the software industry will lead to curved reliability increasing in importance and value
A major challenge to organizational investment in curved reliability is the ability to make its value legible to leadership while simultaneously avoiding oversimplification
"Success" involves the composition of angled and curved reliability
If this resonates or if you have thoughts, drop me a tweet? This writeup sacrifices completeness for brevity—I'm happy to get into the details if you have questions! You may also be interested in the blog and/or newsletter of Learning from Incidents in Software.
[1] via fundamental beliefs and assumptions, e.g. about system linearity, the existence of human error, etc
[2] however you'd like to define it! A colloquial, common sense, and/or slightly ambiguous definition should be fine for our purposes.
[3] To me, “angled” responsibility evokes the “crunchy” nature of metrics & etc. Plus, slightly clunky names may help avoid buzzword-coining.
[4] here we'll use Google's definitions from the Site Reliability Engineering book
[5] For an example see Lorin Hochstein's article on OOPS at Netflix
To reply you need to sign in.