Welcome!

Blog Feed Post

Getting the Most from Your Incident Post-Mortem

What do you do after you’ve experienced an incident and performed a post-mortem (or, postmortem)? That may seem like a simple question, or even a non-question; after all, it’s easy to think of the post-mortem as the last step in handling an incident.

But it’s not. In many ways, what you do with an incident post-mortem can be as important as the post-mortem itself. Below, I explain why and offer tips on what to do after the post-mortem is complete.

Why Post-Mortems?

Before we take a closer look at that question, however, we need to look at an even more basic question: What is the function of a post-mortem, and what should it contain?

An incident post-mortem serves the following basic functions:

  1. It provides a record of the incident, its cause and related symptoms, its resolution, and its impact for future reference. This can be important for both a future understanding of the technical issues and resolution of legal or administrative concerns arising from the incident.
  2. It serves as a basis for analyzing and resolving the fundamental technical problems which gave rise to the incident.
  3. It provides a framework for understanding and improving the incident response process.

To support these basic functions, a post-mortem should include a record of the incident, the response, and its resolution. It should also include an analysis of the root cause of the incident, a description of the scope of the incident and its effects, and any appropriate recommendations for resolving the root problem, improving the response process, and/or mitigating the impacts of future incidents.

Understanding, But Not Blame

It is important to note that a post-mortem should not become a vehicle for blame, or for settling scores in corporate or organizational politics. If necessary, set up a separate process (i.e., informal/moderated discussion within the department) for discussing personnel-related issues, as a way of channeling blame-setting away from the post-mortem itself.

The post-mortem should, however, include an honest discussion of any technical or organizational problems which may have contributed to the incident, or which became apparent during the response. The emphasis should be on improvements in the technology or the response process, rather than the deficiencies of individuals or teams, or of their work.

When is a Post-Mortem Necessary?

Not all incidents require a post-mortem. Minor operational issues, incidents with a well-understood cause and a simple resolution, and incidents which are easily contained with no downtime or loss of data may not need a post-mortem.

Here are a few examples of situations for which a post-mortem is necessary:

  • The incident results in the loss of data or productivity or customer access
  • The incident required shutdown, re-routing, rollback to an earlier software version, and/or prolonged action for resolution
  • The incident was not detected or handled properly by the appropriate monitoring or alerting systems
  • The root cause appears to be unknown, unexpected, or suspicious in nature
  • The issue appears to involve underlying elements of application architecture or technology which may have wide-ranging effects on the operation of the system
  • There were serious problems or inadequacies in the response or resolution process.

Post-Mortems Exist to Facilitate Learning

In order for a post-mortem to be of value, it needs to be read and understood by the people who are responsible for analyzing, resolving, and preventing any of the long-term problems which it describes.

This may mean, for example, that teams or departments with a stake in the problem or its resolution should be required to read the post-mortem and engage in a discussion as soon as possible to determine appropriate next steps as a result. The actual process for circulating post-mortems and ensuring that they are read and lead to action items will, of course, depend on the structure and the managerial philosophy of your organization.

Basic Components of a Post-Mortem

There are three key areas to look at when writing or reading an incident post-mortem:

Root Cause

A post-mortem should always contain a description of the root cause, even if it is known and trivial. If it is non-trivial, the description should include an analysis of the cause, with, if possible, a precise identification of the actual root of the problem and whether the root cause needs to be fixed. If the specific root cause cannot be precisely identified, any information which may lead to its future identification should be included.

If, for example, during the course of the incident’s resolution, it becomes apparent that the problem originated in a module which contains a large amount of legacy code, it is important to include that fact in the root cause analysis, even if it is not possible at the time of the post-mortem to identify the root cause below the level of the module itself. The mere fact of identifying legacy code in connection with an incident may be of value not only in the resolution of the incident but also in later surveys identifying code which needs to be replaced.

Response

The post-mortem should include a full technical description of the response process. It should also include a description and analysis of the relative success or failure of that process. This should be done without pointing the finger of blame at anyone, but it should clearly indicate any apparent failures or weaknesses in the response process, or in the way that the response was carried out. This can include division of responsibilities among response team members, communication within the response team, or between the response team and other stakeholders across the business, and problems with specific response procedures.

Failures of the response process can range from technical or organizational. They can include such simple things as failing to tell affected departments or users that a system or application was unavailable while the problem was being resolved. If two team members performed the same task without coordination between them, or nobody performed a required task, leading to a delay in the resolution, it should be noted in the post-mortem as an indication of potential problems in team organization or communication.

Damage Scope and Control

The post-mortem should include a clear and accurate description of the extent of any damage caused by the incident, including loss of data, loss of productivity, and interruptions in user access. It is equally important to include a description and analysis of any actions taken to limit or remedy this damage. Damage control should be considered as a separate process from technical incident resolution. Depending on the type of incident, the type of damage, and the organization’s structure, it may be a customer service responsibility or require action items for other departments in the business.

Damage control actions should be part of the post-mortem, since they may directly or indirectly affect how similar incidents are handled in the future. If, for example, an outage results in the shutdown of an airline flight reservation system, it may be necessary to give priority to putting into place an alternate system for handling reservations during downtime.

Not Embarrassment, But Gold

The key to getting the most out of post-mortems lies in understanding that they are roadmaps for improvement of your application, your infrastructure, and your response process. Each post-mortem has the potential to improve the way that your system operates and the way that you handle incidents. Rather than treating post-mortems as an embarrassment or indication of some kind of failure, you should this valuable opportunity to reflect as gold.


PageDuty offers a completely free post-mortem handbook that shares industry best practices and includes a post-mortem template. Use it to help you formalize your own post-mortem process to make it as easy as possible for your team to respond to issues. Even better, post-mortems are part of the PagerDuty platform — sign up for a free 14-day trial and streamline the entire post-mortem process with automated timeline building, collaborative editing, actionable insights, and more!

The post Getting the Most from Your Incident Post-Mortem appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
To Really Work for Enterprises, MultiCloud Adoption Requires Far Better and Inclusive Cloud Monitoring and Cost Management … But How? Overwhelmingly, even as enterprises have adopted cloud computing and are expanding to multi-cloud computing, IT leaders remain concerned about how to monitor, manage and control costs across hybrid and multi-cloud deployments. It’s clear that traditional IT monitoring and management approaches, designed after all for on-premises data centers, are falling short in ...
Without lifecycle traceability and visibility across the tool chain, stakeholders from Planning-to-Ops have limited insight and answers to who, what, when, why and how across the DevOps lifecycle. This impacts the ability to deliver high quality software at the needed velocity to drive positive business outcomes. In his general session at @DevOpsSummit at 19th Cloud Expo, Eric Robertson, General Manager at CollabNet, will discuss how customers are able to achieve a level of transparency that e...
It is ironic, but perhaps not unexpected, that many organizations who want the benefits of using an Agile approach to deliver software use a waterfall approach to adopting Agile practices: they form plans, they set milestones, and they measure progress by how many teams they have engaged. Old habits die hard, but like most waterfall software projects, most waterfall-style Agile adoption efforts fail to produce the results desired. The problem is that to get the results they want, they have to ch...
"Venafi has a platform that allows you to manage, centralize and automate the complete life cycle of keys and certificates within the organization," explained Gina Osmond, Sr. Field Marketing Manager at Venafi, in this SYS-CON.tv interview at DevOps at 19th Cloud Expo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
"We're focused on how to get some of the attributes that you would expect from an Amazon, Azure, Google, and doing that on-prem. We believe today that you can actually get those types of things done with certain architectures available in the market today," explained Steve Conner, VP of Sales at Cloudistics, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
For far too long technology teams have lived in siloes. Not only physical siloes, but cultural siloes pushed by competing objectives. This includes informational siloes where business users require one set of data and tech teams require different data. DevOps intends to bridge these gaps to make tech driven operations more aligned and efficient.
Organizations planning enterprise data center consolidation and modernization projects are faced with a challenging, costly reality. Requirements to deploy modern, cloud-native applications simultaneously with traditional client/server applications are almost impossible to achieve with hardware-centric enterprise infrastructure. Compute and network infrastructure are fast moving down a software-defined path, but storage has been a laggard. Until now.
Without a clear strategy for cost control and an architecture designed with cloud services in mind, costs and operational performance can quickly get out of control. To avoid multiple architectural redesigns requires extensive thought and planning. Boundary (now part of BMC) launched a new public-facing multi-tenant high resolution monitoring service on Amazon AWS two years ago, facing challenges and learning best practices in the early days of the new service.
DXWorldEXPO LLC announced today that the upcoming DXWorldEXPO | CloudEXPO New York event will feature 10 companies from Poland to participate at the "Poland Digital Transformation Pavilion" on November 12-13, 2018.
Digital Transformation is much more than a buzzword. The radical shift to digital mechanisms for almost every process is evident across all industries and verticals. This is often especially true in financial services, where the legacy environment is many times unable to keep up with the rapidly shifting demands of the consumer. The constant pressure to provide complete, omnichannel delivery of customer-facing solutions to meet both regulatory and customer demands is putting enormous pressure on...
The best way to leverage your CloudEXPO | DXWorldEXPO presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering CloudEXPO | DXWorldEXPO will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at CloudEXPO. Product announcements during our show provide your company with the most reach through our targeted audienc...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors!
In an era of historic innovation fueled by unprecedented access to data and technology, the low cost and risk of entering new markets has leveled the playing field for business. Today, any ambitious innovator can easily introduce a new application or product that can reinvent business models and transform the client experience. In their Day 2 Keynote at 19th Cloud Expo, Mercer Rowe, IBM Vice President of Strategic Alliances, and Raejeanne Skillern, Intel Vice President of Data Center Group and ...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
More and more brands have jumped on the IoT bandwagon. We have an excess of wearables – activity trackers, smartwatches, smart glasses and sneakers, and more that track seemingly endless datapoints. However, most consumers have no idea what “IoT” means. Creating more wearables that track data shouldn't be the aim of brands; delivering meaningful, tangible relevance to their users should be. We're in a period in which the IoT pendulum is still swinging. Initially, it swung toward "smart for smart...