You must be this tall to do the five whys

October 16, 2021

There recently was an essay by Gergely about incident response, and a related example of “The 5 Whys” questions, which has been mentioned as a good way to do incident retrospectives.

I am not a fan of the “5 whys” approach, because I came to believe there is a problem with it in that it overfocuses on one particular path. With just 5 of the “whys” there is only so much you can express that only touches “abstract group output” or “technology”, without getting into the all-touchy-feely, people side of the issue. “The 5 whys” is by far not deep (and not wide) enough to even start approaching these.

Let’s set the scene: there was an outage due to a bug, and teams are doing a retrospective which focuses on how mitigation was handled (not the bug that caused the incident). We “chop scope” here to maintain focus. Imagine we start with the following “5 whys”:

Why was the site down for customers? Because there was a small bug. It was easy to fix but deploying the fix-forward took too long.
Why did it take so long to release the fix? Because half of the time was spent waiting for CI to pass, and the other half for the fleet to be rotated
Why did it take so long for CI to pass? Because our UI tests are slow and flaky
Why did it take so long to rotate the fleet? Because we do not have sufficient instances to double capacity during release, and the deployed artifact is a very large image
Why do we need immutable infrastructure and large images? Because the industry does not provide a readymade path for “slim overlay” artifacts and because immutable infrastructure has other benefits we found more important than rapid releases

Now imagine a slightly different list of “whys”, asked with a bit more tenacity:

Why was the site down for customers? Because there was a small bug. It was easy to fix but deploying the fix-forward took too long.
Why did it take so long to release the fix-forward instead of rolling back? Because to roll back you need to have adequate stored artifacts and a UI that operators can use in time of duress
Why don’t we have such a UI? Because the deployment control is run by a different team than the operators managing incidents
Why can’t operators create their own tools for deployment control? Because they are going to use tools not familiar to the team managing infrastructure, and that team wants to own and control all of those tools
Why does the team managing infrastructure decide on choice of tools for building deployment control? Because they have their own deployment control scheduled for Q4 of next year.

See - there is already a plot emerging. We have two teams, the incentives of those teams are misaligned, therefore there are no rollbacks and every fix must be a fix-forward, which must trigger the long UI testing path. Oh, and apparently operators of the service do not like the deployment tooling. But we are doing an exercise here, let’s push it a notch and get edgier still. Remember, this page is a “lab” - nobody is going to get hurt:

Why was the site down for customers? Because there was a small bug. It was easy to fix but deploying the fix-forward took too long.
Why did it take so long to release the fix-forward instead of rolling back? Because there was no interface available to perform rollbacks 2b. Why was there no interface to perform rollbacks? Because 6 people tried to set up a deployment control UI which would allow for rollbacks, but all of them were denied the opportunity “to not stir waters”
Why, if the need to have deployment control was so apparent, did we discourage 6 people from solving that problem? Because we didn’t want to make the team maintaining infrastructure upset
Why were we so scared of making them upset? Because a person on that team is known to be volatile, and could have deleted the production database and could create great risks for the business
Why do we have a team which poses such a business risk? Because upsetting them could trigger actions that would severely damage production systems, and we lack proper safeguards to prevent this.

See, this gets hot right there. More importantly, there were multiple questions regarding the applicability of “The 5 Whys” in a very interesting light: “who is interested in uncovering the truth” - and, more importantly, the “truths” to uncover differ depending on who is digging.

People usually prefer to feel good, and usually do not like being blamed or pinpointed for things. And the desire for incident retrospectives is a good one - it is to figure out the reason for the most painful things in the incident, and figuring out how, in the future, something like this can be prevented. The challenge lies in the fact that there are multiple truths. All of the three variants of the conversation above present things which are true, but groups will be motivated to steer towards the “first” variant, where there is no blame, no statement of potential unintended consequences, weakness or fear, and questions which are on many people’s minds get avoided. This first variant is not intrinsically “better” than the rest, and in terms of answering the question about recovery and improving the situation it might provide hints how the organisation could proceed.

But from the perspective of “uncovering the truth” - which, I would concur, is a very noble objective which should not be discarded - you will be getting much more bang for your buck with the “third variant”, where you steer directly into the touchy-feely problems. The issues like the one here very often are issues of dysfunctions within (or between) teams, and they will often be the hardest to address. Fixing them will have an incredible rejuvenating effect as well.

The challenge is twofold: first, 5 is too little. When you are in a situation of extreme stress you do need to be able to get by with “just 5” I imagine, but in a more relaxed case (and we are in this situation mostly - we are just running websites, come on!) - all of the three variants above could be explored and discussed. All of the three variants provide facets of truth, and if a collective is truly mature they will be able to explore more of these facets without becoming defensive or hostile. There are tools for this! Non-violent communication is just one that comes to mind.

It is not a linked list, it is a tree - and doing a good traversal of that tree can, indeed, “uncover” truth that you do want to know - or, often, that is long overdue being put on paper or spoken out in public. This is where emotional safety, trust and vulnerability are paramount. Of course you could say this is a conversation path only those in positions of relative safety and privilege can permit themselves to have. And yet: ponder the idea that you could get to a mode of communication where you could unwind things to the level of “stop short of societal problems at large” instead of “stop short of mentioning anything related to how people behave badly with one another”.

Compare the potential outcomes of this particular session:

If we use the first variant, likely we will decide to “add button to bypass UI tests”. This is a plausible approach, only it introduces its own decision trees later:
- How do we know our UI doesn’t have regressions?
- Who decides which of the builds with skipped UI tests is “good enough”?
- There is still no rollback…
- A likely consequence of the tradeoff will be “We have deployed a version with broken UI to mitigate an unrelated bug”
If we use the last variant:
- We uncover a great risk to the business and we can look for reconciliation or rupture to mitigate that risk
- We can start work on deployment tooling
- We ruin silos teams have worked themselves into
- We likely don’t have to skimp on UI testing because we will be able to roll back to a “known good” version

Only you must be this tall. The very hot, intense “5 Whys” bursts could work - between equal co-founder peers, and in situations of extremely high levels of trust. That is not a luxury afforded to everybody.

Or you should skip the 5 whys alltogether and go for contributing factor analysis instead. In fact, this is the methodology I would recommend most teams of medium size and up.

P.S. As this article was being finished, news finally came in that the industry we must learn from - aviation - is learning its lesson, as Boeing’s chief technical pilot is indicted for fraud. I do not hold my hopes for higher Boeing brass getting actual accountability, but one can dream.