Blog Feed Post

Using Postmortems to Understand Service Reliability

2017 was a year of many major outages—some took down the Internet for hours while others disrupted business workflows and communication at companies large and small. Any way you slice it, these outages likely resulted in a lot of time devoted to postmortems.

I want to reflect a bit on why we write postmortems and suggest some things for authors to think about when writing them. I think there’s room for all of us to improve when it comes to gathering information to better plan pro-active fixes before services catch fire.

Why Do We Conduct Postmortems?

Our incident response training docs put it this way: “Effective post-mortem[s] allow us to learn quickly from our mistakes and improve our services and processes for everyone.” The key takeaway for me is that organizations should use postmortems to capture what they learned from an incident. In other words:

  1. Postmortems are an exercise to learn the specifics of why an incident happened and what needs to be done to prevent this incident in the future.
  2. Organizations should try and learn how effective their incident response process is and what areas can be improved.

I think these two points are what are generally talked about when people talk about “Root Cause Analysis and Causal Factors,” and “What Went Well” and “What Didn’t Go Well” in postmortems.

That’s not what I want to talk about here though.

I think there’s another layer we get out of the postmortem process itself that hasn’t usually been part of the discussion: communicating about your service’s long-term stability.

For example, in one major incident, postmortems of minor incidents in the same service leading up to it highlighted nothing of concern—until the big incident happened. After it was resolved, the major incident postmortem looked at the “Role of Previous Incidents” and found that all identified immediate and P1 follow-ups were completed or canceled due to changing plans or new information (it’s easy and okay to de-prioritize or not do something if it looks like a single occurrence).

During the time of the minor incidents up until the big incident, there certainly was work going on with regards that particular platform, but I don’t think that anyone would say that the service was in good health! The postmortems for the incidents during this period focused on the immediate issues of the incident—they didn’t capture the health of the service as a whole. As humans, we’re bad at remembering things, so it’s important to look at broader trends to see if there is a recurring issue or not. I think there’s opportunity to level up processes by devoting more attention here when writing a postmortem report.

At PagerDuty, we’re service-owning engineering teams, so we have opinions about the ongoing stability of our teams’ services. When a major incident occurs involving a service, it forces us to think about our judgment of the stability, and whether our opinion about the long-term health has changed because of the incident. If it has, we then re-evaluate our plans to determine whether we need to prioritize large-scope work to improve that service. For a postmortem report, the crucially important thing to remember is that the things we choose not to do as action items are as important to capture as the action items we decide to do.

When looking over postmortem action items, we found that they tend to be very fine-grained and tightly scoped—upgrade this library, add this monitor, and so on. The guidance that floats around for action items timelines reinforces this. But it’s also important to communicate beyond that—needs for large-scoped remedial improvements that are spotted early on are much easier to work into the roadmaps of teams. I think engineering teams, since they’re the people closest to services, often have a lot of internal knowledge and good instincts about the health of services, but don’t always have a good way to share them and to highlight issues that need larger work. By including this information in postmortem reports, it’s an opportunity to be more transparent about these looming vulnerabilities.

The postmortem report is not just for the team conducting it and owning the service—the team prepares the report and conducts the postmortem investigation, but the final report itself is for the whole organization. A good report captures the risks of our current services, and will help Product and Engineering to more proactively prioritize work on services.

Five Questions to Answer During a Postmortem (None of Which Are “Why”)

Someone from outside your team should be able to read your postmortem report and answer these five questions:

  1. How did we view the health of the service involved prior to the incident?
  2. Did this incident teach us something that should change our views about this service’s health?
  3. Was this an isolated and specific bug—a failure in a class of problem we anticipated—or did it uncover a class of issue we did not architecturally anticipate in the service?
  4. Do we think an incident akin to this one will happen again if we don’t take larger systemic action beyond the action items captured here?
  5. Will this class of issue get worse/more likely to happen as we continue to grow and scale the use of the service?

*Bonus question: Was there a previous incident that showed early signs pointing to this one?

I’d expect these usually to be used as introductory text to the “Action Items” the team intends to take, but sometimes “What Went Well” or “What Didn’t Go Well” will be more appropriate.

Additionally, if there are divergent views within the team preparing the report about the questions, that is also something to capture! Uncertainty is a valuable signal.

There are also some things to clarify about what we think we are accomplishing with the action items we are taking.

Ask yourselves, are we:

  1. Dealing with a specific issue immediately in a narrow, targeted way?
  2. Taking action to eliminate what we see as an entire class of potential issues?
  3. Not taking action, because larger efforts are already underway and will rapidly obsolete a targeted fix? (If so, those larger efforts should be called out!)
  4. Not taking significant action because we don’t think it’s justified?

Learning more from and communicating better with postmortems will help you improve services and reduce the number and severity of incidents you encounter. We all want fewer major incidents and more sleep, and we can have that if we make sure we’re learning all we can from the incidents we do have.


Be sure to check out our Postmortem Handbook in which we share lessons learned from the trenches and how you can conduct better postmortems. Or dive directly into the product and try our streamlined postmortem process where you can create incident reports with a single click. Sign up for a free trial to get started!

The post Using Postmortems to Understand Service Reliability appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
DXWorldEXPO LLC announced today that Kevin Jackson joined the faculty of CloudEXPO's "10-Year Anniversary Event" which will take place on November 11-13, 2018 in New York City. Kevin L. Jackson is a globally recognized cloud computing expert and Founder/Author of the award winning "Cloud Musings" blog. Mr. Jackson has also been recognized as a "Top 100 Cybersecurity Influencer and Brand" by Onalytica (2015), a Huffington Post "Top 100 Cloud Computing Experts on Twitter" (2013) and a "Top 50 C...
Cloud-enabled transformation has evolved from cost saving measure to business innovation strategy -- one that combines the cloud with cognitive capabilities to drive market disruption. Learn how you can achieve the insight and agility you need to gain a competitive advantage. Industry-acclaimed CTO and cloud expert, Shankar Kalyana presents. Only the most exceptional IBMers are appointed with the rare distinction of IBM Fellow, the highest technical honor in the company. Shankar has also receive...
Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities - ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups.
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Daniel Jones is CTO of EngineerBetter, helping enterprises deliver value faster. Previously he was an IT consultant, indie video games developer, head of web development in the finance sector, and an award-winning martial artist. Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams.
The standardization of container runtimes and images has sparked the creation of an almost overwhelming number of new open source projects that build on and otherwise work with these specifications. Of course, there's Kubernetes, which orchestrates and manages collections of containers. It was one of the first and best-known examples of projects that make containers truly useful for production use. However, more recently, the container ecosystem has truly exploded. A service mesh like Istio addr...
As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory? In her Day 2 Keynote at @DevOpsSummit at 21st Cloud Expo, Aruna Ravichandran, VP, DevOps Solutions Marketing, CA Technologies, was jo...
Predicting the future has never been more challenging - not because of the lack of data but because of the flood of ungoverned and risk laden information. Microsoft states that 2.5 exabytes of data are created every day. Expectations and reliance on data are being pushed to the limits, as demands around hybrid options continue to grow.
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
As IoT continues to increase momentum, so does the associated risk. Secure Device Lifecycle Management (DLM) is ranked as one of the most important technology areas of IoT. Driving this trend is the realization that secure support for IoT devices provides companies the ability to deliver high-quality, reliable, secure offerings faster, create new revenue streams, and reduce support costs, all while building a competitive advantage in their markets. In this session, we will use customer use cases...
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
Evan Kirstel is an internationally recognized thought leader and social media influencer in IoT (#1 in 2017), Cloud, Data Security (2016), Health Tech (#9 in 2017), Digital Health (#6 in 2016), B2B Marketing (#5 in 2015), AI, Smart Home, Digital (2017), IIoT (#1 in 2017) and Telecom/Wireless/5G. His connections are a "Who's Who" in these technologies, He is in the top 10 most mentioned/re-tweeted by CMOs and CIOs (2016) and have been recently named 5th most influential B2B marketeer in the US. H...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
DevOpsSummit New York 2018, colocated with CloudEXPO | DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City. Digital Transformation (DX) is a major focus with the introduction of DXWorldEXPO within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive over the long term. A total of 88% of Fortune 500 companies from a generation ago are now out of bus...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...