The software post-mortem: what, when, how, and why

Typically, a ‘post-mortem’ is a phrase you’ll hear in a medical context. It refers to an examination or investigation that takes place to determine the cause of a death.

But post-mortems also happen outside of a hospital bed. Indeed, the software post-mortem is a useful process for developers as they write and review their code.

Let’s take a closer look.

What is a software post-mortem?

A software post-mortem is a process that is used by software development teams to identify two key things:

  1. The cause of a process failure in the software
  2. How to prevent recurrence of the failure

A software post-mortem is different from a retrospective — which is where both positive and negative points are reviewed. In a post-mortem, only the failure or incident is reviewed.

Importantly, the point of this analysis is not to place blame on a person or group. Rather, it’s to find and fix errors, and learn from them so mistakes aren’t repeated.

Creating a post-mortem report is a collaborative process. The whole development team takes part, as it allows everyone to reflect and learn from process errors.

When caSaaStrophe strikes…

Post-mortem means “after death”. In software, however, it’s more like “after failure.” So, you should conduct a software post-mortem after every major incident with your software.

An incident consists of major bugs, downtime, or other failures that directly impact users/customers/stakeholders. Essentially, you conduct a post-mortem following a business-impairing caSaaStrophe.

Basically, any time something goes severely wrong with your software, you conduct a post-mortem after fixing it.

How to create a post-mortem report

A software post-mortem is a discussion and analysis of the (averted) disaster. To create one, you must analyse the event in its entirety.

Your report will include:

1.       What happened

First, outline the details of the incident you’re analysing. This will provide context and a starting point for the rest of the post-mortem, as well as a handy record of the incident.  

Start with what failed: what functionality was affected? Did the software go down completely? Be sure to outline who (and how many) the failure impacted.

Next, detail the steps that were taken, and how long the issue lasted. This would include who noticed the issue and who was involved in resolving it.

Finally, outline the actions taken to fix the problem, including those that failed and those that were successful.

2.       An analysis of the root causes of the incident

With the event recorded, you can start to dive into the details. This starts with diagnosing the root cause of the incident. That is, what broke, which processes failed, and why.

I.e., Was there an issue with the code? With an update? With an integration? Was there missing or incorrect data input? Was there a cyberattack?

And so on.

3.       How the problem was diagnosed and managed

Part of understanding a problem is understanding how you fixed it. So, be sure to analyse in greater detail the steps you took to diagnose and resolve the issue as it happened.

For example:

  • How was the problem discovered?
  • What fixes didn’t work, and why?
  • Why was the resolution particularly successful?

This will also help teams to reflect on any weaknesses in monitoring or crisis management. Could the problem have been noticed or fixed sooner or in a better way?

4.       What was learned through the experience and process

The next step of the software post-mortem is the bit that brings the benefits to light. That is, highlighting exactly what has been learned from the analysis and the experience as a whole.

For instance, you may have noticed some code weaknesses that need to be fixed. Perhaps the failure has pointed to a gap in your cybersecurity.

You should analyse your response to the failure — what worked well? Where could you improve your crisis management skills?

5.       Strategy to prevent failure recurrence

The final part of a software post-mortem is to decide and outline the strategies and steps you’ll take moving forward. What will happen, and what do the team need to do, to prevent the flaw from happening again?

This might include extra code reviews to look for code inefficiencies. Or, it could mean new cybersecurity policies in place, for instance.

Why conduct a software post-mortem?

Conducting a software post-mortem is not about assigning blame. Rather, it’s about finding failed processes. (Remember, failing processes, not failing people.) But why is that useful if you’ve already fixed the problem?

The point of a post-mortem is to prevent problems from happening again. It’s a way to make sure developers don’t make the same mistake that led to the failure in future projects.

By conducting this kind of analysis, then, you create and offer better software. That is, software that’s robust, stable, and free from major bugs.

In short, it improves your software and your skills. It allows developers to learn from experience and prevent future mistakes from ever happening.

TL;DR: The software post-mortem

A software post-mortem is an analysis conducted after a system failure. The goal is to understand why an incident / error happened, and to learn from the experience. In so doing, future software becomes more robust.

Problems happen. But by conducting a post-problem analysis, you can ensure that they don’t happen again.

Useful links

Rubber ducking: not just a funny phrase

The myth of the clean code base

Your product update: don’t throw the baby out with the bathwater

ELI5: what is penetration testing?