Blog Feed Post

Avoiding Incident Response Bottlenecks

https://www.pagerduty.com/wp-content/uploads/2017/03/bottleneck-lines-76... 768w, https://www.pagerduty.com/wp-content/uploads/2017/03/bottleneck-lines-10... 1024w, https://www.pagerduty.com/wp-content/uploads/2017/03/bottleneck-lines-25... 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/bottleneck-lines-18... 180w" sizes="(max-width: 390px) 100vw, 390px" />Incident response bottlenecks – you know they’re real and you know that your incident response system probably has a few, but they must be minimized as they hurt your on-call teams and your customers. Let’s take a look at some of the most critical bottlenecks and how to avoid them.

What Are Your Goals?

First, before you understand the bottlenecks in any process, you have to understand what the goals of that process are. What are the goals of incident response?

For most incident response teams, the basic list of goals would probably look something like this:

  • To prevent incidents from occurring. Prevention at this level may be largely out of the hands of incident management, which is generally focused on resolving issues, but prevention is critical for reducing unplanned work.
  • To keep the damage confined to the smallest scope possible. In practice, this is where most of the preventive effort in incident management is focused. If you can’t keep incidents from happening, you can keep them from spreading.
  • To resolve incidents quickly. Not all incidents get resolved and not all apparent fixes genuinely resolve the underlying problems, but incident resolution remains the bottom line.

Watch Out For These Bottlenecks

If the above are the basic goals of incident response, then the bottlenecks are likely to be conditions which make it difficult to meet those goals. The most important of these are:

Failure to adequately set priorities.

Prioritization is the most important tool available for both incident resolution and confining incident impact. It’s what allows you to focus on the incidents that are most in need of attention based on their potential for serious impact. It allows you to set aside incidents which are relatively minor in their impact, but can take up a great deal of incident response team time and attention. When you fail to set adequate priorities, you almost guarantee that some major incidents will not be handled promptly, or possible not at all.

Alert fatigue and incident overload.

If your response team is overwhelmed by the volume of alerts, they may become effectively paralyzed, and unable to respond at all, simply because they don’t have the time to recognize which issues should have top priority, or to separate genuine incidents from alert noise. Eventually, this can lead to chronic alert fatigue, as team members develop the unconscious mental habit of blocking out most alerts, so that they’re able to concentrate on at least a few of them.

A (typically automated) system for filtering out alert noise is an absolute necessity. Non-actionable alerts should be suppressed, and related alert context should be grouped into a single incident. Ideally, all this should be done automatically via rules. Additionally, it’s crucial to implement a system for channeling alerts to the correct teams or team members, rather than broadcasting them to all teams and all members, as repeated alert fatigue and lack of accountability can also quickly become fatal.

Inadequate preparation, training, or experience.

Ideally, every incident response team should consist of highly-trained and experienced technicians who are able to diagnose problems quickly and who understand which tools and techniques they should use to fix each incident. 

In practice, of course, it isn’t so simple. High turnover and the need for more responders may result in response teams in which most or even all of the members have little or no experience. When this happens, considerable time may be lost as new team members learn things that experienced responders already know. When there is a complete break in continuity (an entirely new team), the situation may be made much worse, because the old team’s knowledge is now “lost wisdom” and often unrecoverable.

The best ways to minimize such problems are to have a formal system of training for incident responders, to place new team members in teams with experienced responders whenever possible, and to have adequate documentation available to response teams. Documentation to ensure consistent, repeatable best practices should include some kind of basic manual of procedures and a well-indexed, easily searchable, cross-referenced database of past incidents, such as a runbook.

Inadequate preparation for a major new rollout.

A new release of a major application is coming up — or maybe it’s an entirely new application or service — is your response team ready to handle the volume of alerts if it turns out that the developers missed a few serious bugs? Murphy’s Law is, after all, waiting around every corner. All it takes is an update to a widely used program and one or two bugs of the kind that produce a cascade of not-so-easy-to-trace errors. If your response team isn’t prepared, you may find that all of your time and resources are taken up by a storm of high-priority alerts, leaving you with very few reserves for handling any other unrelated incidents that may come up. 

Ideally, of course, the update will be adequately tested before full release, with some kind of limited A/B or canary type of deployment. As long as the response team is part of this deployment, they will have the opportunity to deal with problems that do arise on a much smaller scale. The decision to start with a limited deployment, however, is likely to be out of the hands of the incident-response team, and they may have to deal with an inadequately tested release going directly to full deployment.

When this happens, it may be necessary to place all responders on-call — or to designate a special team to handle all update-related problems, freeing up at least some responders to handle unrelated, unplanned issues that also must be addressed. Which approach works best depends at least in part on the scope of the update, and the response team resources that are available. However, plans can always be iterated on as needed and having some plan in place will make a significant difference in comparison to being completely unprepared.

Clearing Things Up

There are plenty of other bottlenecks, of course, including those that arise from outdated, failure-prone, or overloaded infrastructure, as well as the kind that result from response team time being co-opted for non-incident-related tasks. But the bottlenecks that we’ve listed account for much of the time lost by incident response teams, and the remediation approaches that we’ve suggested help to clear up most of them.


Try PagerDuty free for 14 days!


The post Avoiding Incident Response Bottlenecks appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
"At the keynote this morning we spoke about the value proposition of Nutanix, of having a DevOps culture and a mindset, and the business outcomes of achieving agility and scale, which everybody here is trying to accomplish," noted Mark Lavi, DevOps Solution Architect at Nutanix, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
DX World EXPO, LLC., a Lighthouse Point, Florida-based startup trade show producer and the creator of "DXWorldEXPO® - Digital Transformation Conference & Expo" has announced its executive management team. The team is headed by Levent Selamoglu, who has been named CEO. "Now is the time for a truly global DX event, to bring together the leading minds from the technology world in a conversation about Digital Transformation," he said in making the announcement.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
"With Digital Experience Monitoring what used to be a simple visit to a web page has exploded into app on phones, data from social media feeds, competitive benchmarking - these are all components that are only available because of some type of digital asset," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
"Outscale was founded in 2010, is based in France, is a strategic partner to Dassault Systémes and has done quite a bit of work with divisions of Dassault," explained Jackie Funk, Digital Marketing exec at Outscale, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We were founded in 2003 and the way we were founded was about good backup and good disaster recovery for our clients, and for the last 20 years we've been pretty consistent with that," noted Marc Malafronte, Territory Manager at StorageCraft, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex ...
While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...
"DivvyCloud as a company set out to help customers automate solutions to the most common cloud problems," noted Jeremy Snyder, VP of Business Development at DivvyCloud, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We focus on SAP workloads because they are among the most powerful but somewhat challenging workloads out there to take into public cloud," explained Swen Conrad, CEO of Ocean9, Inc., in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"As we've gone out into the public cloud we've seen that over time we may have lost a few things - we've lost control, we've given up cost to a certain extent, and then security, flexibility," explained Steve Conner, VP of Sales at Cloudistics,in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We provide IoT solutions. We provide the most compatible solutions for many applications. Our solutions are industry agnostic and also protocol agnostic," explained Richard Han, Head of Sales and Marketing and Engineering at Systena America, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I think DevOps is now a rambunctious teenager – it’s starting to get a mind of its own, wanting to get its own things but it still needs some adult supervision," explained Thomas Hooker, VP of marketing at CollabNet, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.