At 3AM Pacific time on November Wednesday 6th, Liz Fong-Jones tweeted a report that Honeycomb was experiencing network issues on AWS and asking whether others were as well. This tweet was part of response for the incident which became Running Dry on Memory Without Noticing, and was remediated later Wednesday morning. In line with Honeycomb’s transparent culture, Liz essentially live-tweeted her part of the incident response, including a visualization showing that the precipitating event was a memory leak (Honeycomb dogfoods, so this screenshot is from their product):
On November 8th, Honeycomb held an incident review, and on November 21st, Liz published the public report linked above. At Liz’ invitation, I was a guest at the review, and below is what I learned. My writeup focuses on what I found most surprising and where I perceived the largest gaps between work as imagined and work as done, and it assumes that you have read Liz’ report.
As noted in the public report, the precipitating event was diagnosed as a memory leak by a well rested engineer who came online with fresh eyes Wednesday morning. In the incident review, it was quickly determined that this leak was caused by an upgrade to a zstd Go library in a commit deployed on the afternoon of Tuesday November 5. In the intervening ~12 hours, the issue remained latent due to deploys causing processes to (safely) restart, releasing their memory. It was not until early on Wednesday that the time between deploys was large enough for the leak to begin triggering crashes.
In the incident review, it turned out that the memory leak in the new zstd was a surprise—the engineer who made the change was not aware of this change in behavior between versions. I know Honeycomb cares deeply about operational excellence; I’m confident that this commit was made with diligence and that the engineer who made it observed the system after release and believed that production was healthy going into Tuesday evening. So the question becomes: why did that assessment of health make sense at the time? And what might we learn for the future?
Sadly the engineer involved in the zstd upgrade was not at the incident review, but a fascinating fact emerged! Honeycomb’s CI/CD process includes a Valgrind-like check for various resource leaks (which turns out to not catch leaks in dependencies). I can see this sort of tooling contributing to a powerful—and unfortunately incorrect—sense of security: that this class of latent bug would be caught by automated tooling. In my experience this tooling can also focus on first party code changes vs dependencies, config changes, or the like.
The public report clearly identifies confirmation bias (anchoring to AWS networking as a precipitating event) as a major contributing factor. Another major contributing factor was the loss of internal telemetry data due to process crash—that is, while failures at the ALB could be observed, failures in Honeycomb Go processes were obscured. Investigating improvements that would avoid loss of telemetry data on process crash is called out as an action item.
I think it is worth reiterating the challenge that lack of positive evidence of failure presents to operators. Reasoning about whether problems should be attributed to service or network behavior (or both!) is hard enough without having to also consider the subtleties of absence of evidence versus evidence of absence. This case strikes me as another example of how failures in observability tools are particularly painful and missing data particularly pernicious—which, of course, Honeycomb folks know better than I do. This is also an example of the benefits of transparency in incident response—experts at Netflix also weighed in on Twitter! I’d love to hear from Honeycomb how they handle this class of problems as well as what they recommend their customers do.
I think the public report does a great job covering how learning from the incident will inform improvements to Honeycomb’s incident process as well as embracing the cultural and process aspects of reliability in complex socio-technical systems (although obviously, the devil is in the details). One related topic that came up in the review is that looking into falsification of hypotheses can help lower the chance of being “led down the garden path” by confirmation bias. This seems especially helpful for high impact cases where falsification work can be parallelized onto some engineers while others concurrently plan towards remediation assuming a hypothesis is true.
Work in the safety discipline of resilience engineering suggests that we consider what went right in addition to what went wrong. Most of the time things go well, and that success is powered in large part by our expertise. We can improve overall reliability by making things more good as well as less bad.
The public report acknowledges the central positive role the customer-facing SLO played in detecting the event. More subtly, one thing that came up in the incident review was that on Wednesday, information on an alternate hypothesis (the memory leak) from a newly participating engineer was welcomed. That is, an engineer presented contradictory evidence to a group of engineers confident in an alternate hypothesis they had spent many hours working under—and they were open to it. In my opinion this is anecdotal evidence of Westrum generativity.
Finally I would say that incident review environment itself was collegial and focused on understanding, fixing, and learning from the incident. As a low-context guest, I am ill-equipped to understand anything interpersonal or organizational that might be happening “under the surface”, but I witnessed healthy and lively debate and saw no red flags.
In my opinion the best resource today for improving incident review is Etsy’s Debriefing Facilitation Guide. The Learning from Incidents in Software community (LFI for short, founded by Nora Jones) has also recently launched a public website and blog with great content. Another great source is the blog of Adaptive Capacity Labs. ACL is an incident analysis focused consultancy founded by John Allspaw, Dr. Richard Cook, and Dr. David Woods, giants in the field of resilience engineering and its intersection with software.
Given the prevalence of video conferencing in companies today, I’d also recommend recording incident reviews. Incident report documents are a somewhat lossy synthesis of the discussion that happens during the review—and as Cook notes, “even high quality video recording cannot recover the dynamic in the room.” Archiving a recording allows you to more easily mine insight from earlier incidents as your incident review/analysis skills level up.
I’m happy and interested to be a guest at your incident review if that’s an option. Just drop me a note on Twitter for scheduling!
As far as I know it is unusual for incident reviews to include external guests, let alone allow them to write up their experiences, and I am grateful to Liz and Honeycomb for the opportunity to participate. I am also grateful to Lime and Affirm for earlier opportunities to sit in on incident reviews, which informed my approach to my visit to Honeycomb.
I have been lucky to participate in the LFI community, and it has played an instrumental role in my exploration and learning at the intersection of safety science and software engineering. Thanks to the LFI community and especially to the admins who invest their time and hard work into cultivating it.
I participated in a one hour incident review with no prior knowledge of Honeycomb beyond what is publicly available. I don’t know what power dynamics or goal conflicts exist there, or even the internal name for their ingest service (which is Shepherd). I didn’t interview anyone or facilitate the incident review. This post should not be considered (formal?) incident analysis; rather it is an example of what can be learned from attending others’ incident reviews, even under those constraints.
I take the trust placed by organizations who invite me into reviews seriously, and I had this post reviewed by Liz before publishing to make sure Honeycomb found it appropriate.
Thanks to Liz Fong-Jones, Ben Hartshorne, Ryan Kitchens, and Paul Osman for feedback on drafts of this post.
 Were they aware, it may not have been merged—or it may have been, with a mitigation like periodic restarts to services put in place. Talking hypothetically about what could have happened is counterfactual, and usually doesn’t help us understand the actions folks actually took.