Welcome!

Blog Feed Post

Getting the Most from Your Incident Post-Mortem

What do you do after you’ve experienced an incident and performed a post-mortem (or, postmortem)? That may seem like a simple question, or even a non-question; after all, it’s easy to think of the post-mortem as the last step in handling an incident.

But it’s not. In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.

Why Post-Mortems?

Before we take a closer look at that question, however, we need to look at an even more basic question: What is the function of a post-mortem, and what should it contain?

An incident post-mortem serves the following basic functions:

  1. It provides a record of the incident, its cause and related symptoms, its resolution, and its impact for future reference. This can be important for both a future understanding of the technical issues and resolution of legal or administrative concerns arising from the incident.
  2. It serves as a basis for analyzing and resolving the fundamental technical problems which gave rise to the incident.
  3. It provides a framework for understanding and improving the incident response process.

To support these basic functions, a post-mortem should include a record of the incident, the response, and its resolution. It should also include an analysis of the root cause of the incident, a description of the scope of the incident and its effects, and any appropriate recommendations for resolving the root problem, improving the response process, and/or mitigating the impacts of future incidents.

Understanding, But Not Blame

It is important to note that a post-mortem should not become a vehicle for blame, or for settling scores in corporate or organizational politics. If necessary, set up a separate process (i.e., informal/moderated discussion within the department) for discussing personnel-related issues, as a way of channeling blame-setting away from the post-mortem itself.

The post-mortem should, however, include an honest discussion of any technical or organizational problems which may have contributed to the incident, or which became apparent during the response. The emphasis should be on improvements in the technology or the response process, rather than the deficiencies of individuals or teams, or of their work.

When is a Post-Mortem Necessary?

Not all incidents require a post-mortem. Minor operational issues, incidents with a well-understood cause and a simple resolution, and incidents which are easily contained with no downtime or loss of data may not need a post-mortem.

Here are a few examples of situations for which a post-mortem is necessary:

  • The incident results in the loss of data or productivity or customer access
  • The incident required shutdown, re-routing, rollback to an earlier software version, and/or prolonged action for resolution
  • The incident was not detected or handled properly by the appropriate monitoring or alerting systems
  • The root cause appears to be unknown, unexpected, or suspicious in nature
  • The issue appears to involve underlying elements of application architecture or technology which may have wide-ranging effects on the operation of the system
  • There were serious problems or inadequacies in the response or resolution process.

Post-Mortems Exist to Facilitate Learning

In order for a post-mortem to be of value, it needs to be read and understood by the people who are responsible for analyzing, resolving, and preventing any of the long-term problems which it describes.

This may mean, for example, that teams or departments with a stake in the problem or its resolution should be required to read the post-mortem and engage in a discussion as soon as possible to determine appropriate next steps as a result. The actual process for circulating post-mortems and ensuring that they are read and lead to action items will, of course, depend on the structure and the managerial philosophy of your organization.

Basic Components of a Post-Mortem

There are three key areas to look at when writing or reading an incident post-mortem:

Root Cause

A post-mortem should always contain a description of the root cause, even if it is known and trivial. If it is non-trivial, the description should include an analysis of the cause, with, if possible, a precise identification of the actual root of the problem and whether the root cause needs to be fixed. If the specific root cause cannot be precisely identified, any information which may lead to its future identification should be included.

If, for example, during the course of the incident’s resolution, it becomes apparent that the problem originated in a module which contains a large amount of legacy code, it is important to include that fact in the root cause analysis, even if it is not possible at the time of the post-mortem to identify the root cause below the level of the module itself. The mere fact of identifying legacy code in connection with an incident may be of value not only in the resolution of the incident but also in later surveys identifying code which needs to be replaced.

Response

The post-mortem should include a full technical description of the response process. It should also include a description and analysis of the relative success or failure of that process. This should be done without pointing the finger of blame at anyone, but it should clearly indicate any apparent failures or weaknesses in the response process, or in the way that the response was carried out. This can include division of responsibilities among response team members, communication within the response team, or between the response team and other stakeholders across the business, and problems with specific response procedures.

Failures of the response process can range from technical or organizational. They can include such simple things as failing to tell affected departments or users that a system or application was unavailable while the problem was being resolved. If two team members performed the same task without coordination between them, or nobody performed a required task, leading to a delay in the resolution, it should be noted in the post-mortem as an indication of potential problems in team organization or communication.

Damage Scope and Control

The post-mortem should include a clear and accurate description of the extent of any damage caused by the incident, including loss of data, loss of productivity, and interruptions in user access. It is equally important to include a description and analysis of any actions taken to limit or remedy this damage. Damage control should be considered as a separate process from technical incident resolution. Depending on the type of incident, the type of damage, and the organization’s structure, it may be a customer service responsibility or require action items for other departments in the business.

Damage control actions should be part of the post-mortem, since they may directly or indirectly affect how similar incidents are handled in the future. If, for example, an outage results in the shutdown of an airline flight reservation system, it may be necessary to give priority to putting into place an alternate system for handling reservations during downtime.

Not Embarrassment, But Gold

The key to getting the most out of post-mortems lies in understanding that they are roadmaps for improvement of your application, your infrastructure, and your response process. Each post-mortem has the potential to improve the way that your system operates and the way that you handle incidents. Rather than treating post-mortems as an embarrassment or indication of some kind of failure, you should this valuable opportunity to reflect as gold.


PageDuty offers a completely free post-mortem handbook that shares industry best practices and includes a post-mortem template. Use it to help you formalize your own post-mortem process to make it as easy as possible for your team to respond to issues. Even better, post-mortems are part of the PagerDuty platform — sign up for a free 14-day trial and streamline the entire post-mortem process with automated timeline building, collaborative editing, actionable insights, and more!

The post Getting the Most from Your Incident Post-Mortem appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices t...
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"I focus on what we are calling CAST Highlight, which is our SaaS application portfolio analysis tool. It is an extremely lightweight tool that can integrate with pretty much any build process right now," explained Andrew Siegmund, Application Migration Specialist for CAST, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to clos...
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, discussed how they built...
The dynamic nature of the cloud means that change is a constant when it comes to modern cloud-based infrastructure. Delivering modern applications to end users, therefore, is a constantly shifting challenge. Delivery automation helps IT Ops teams ensure that apps are providing an optimal end user experience over hybrid-cloud and multi-cloud environments, no matter what the current state of the infrastructure is. To employ a delivery automation strategy that reflects your business rules, making r...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
The past few years have brought a sea change in the way applications are architected, developed, and consumed—increasing both the complexity of testing and the business impact of software failures. How can software testing professionals keep pace with modern application delivery, given the trends that impact both architectures (cloud, microservices, and APIs) and processes (DevOps, agile, and continuous delivery)? This is where continuous testing comes in. D
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve f...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...