Blog Feed Post

Avoiding Incident Response Bottlenecks

https://www.pagerduty.com/wp-content/uploads/2017/03/bottleneck-lines-76... 768w, https://www.pagerduty.com/wp-content/uploads/2017/03/bottleneck-lines-10... 1024w, https://www.pagerduty.com/wp-content/uploads/2017/03/bottleneck-lines-25... 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/bottleneck-lines-18... 180w" sizes="(max-width: 390px) 100vw, 390px" />Incident response bottlenecks – you know they’re real and you know that your incident response system probably has a few, but they must be minimized as they hurt your on-call teams and your customers. Let’s take a look at some of the most critical bottlenecks and how to avoid them.

What Are Your Goals?

First, before you understand the bottlenecks in any process, you have to understand what the goals of that process are. What are the goals of incident response?

For most incident response teams, the basic list of goals would probably look something like this:

  • To prevent incidents from occurring. Prevention at this level may be largely out of the hands of incident management, which is generally focused on resolving issues, but prevention is critical for reducing unplanned work.
  • To keep the damage confined to the smallest scope possible. In practice, this is where most of the preventive effort in incident management is focused. If you can’t keep incidents from happening, you can keep them from spreading.
  • To resolve incidents quickly. Not all incidents get resolved and not all apparent fixes genuinely resolve the underlying problems, but incident resolution remains the bottom line.

Watch Out For These Bottlenecks

If the above are the basic goals of incident response, then the bottlenecks are likely to be conditions which make it difficult to meet those goals. The most important of these are:

Failure to adequately set priorities.

Prioritization is the most important tool available for both incident resolution and confining incident impact. It’s what allows you to focus on the incidents that are most in need of attention based on their potential for serious impact. It allows you to set aside incidents which are relatively minor in their impact, but can take up a great deal of incident response team time and attention. When you fail to set adequate priorities, you almost guarantee that some major incidents will not be handled promptly, or possible not at all.

Alert fatigue and incident overload.

If your response team is overwhelmed by the volume of alerts, they may become effectively paralyzed, and unable to respond at all, simply because they don’t have the time to recognize which issues should have top priority, or to separate genuine incidents from alert noise. Eventually, this can lead to chronic alert fatigue, as team members develop the unconscious mental habit of blocking out most alerts, so that they’re able to concentrate on at least a few of them.

A (typically automated) system for filtering out alert noise is an absolute necessity. Non-actionable alerts should be suppressed, and related alert context should be grouped into a single incident. Ideally, all this should be done automatically via rules. Additionally, it’s crucial to implement a system for channeling alerts to the correct teams or team members, rather than broadcasting them to all teams and all members, as repeated alert fatigue and lack of accountability can also quickly become fatal.

Inadequate preparation, training, or experience.

Ideally, every incident response team should consist of highly-trained and experienced technicians who are able to diagnose problems quickly and who understand which tools and techniques they should use to fix each incident. 

In practice, of course, it isn’t so simple. High turnover and the need for more responders may result in response teams in which most or even all of the members have little or no experience. When this happens, considerable time may be lost as new team members learn things that experienced responders already know. When there is a complete break in continuity (an entirely new team), the situation may be made much worse, because the old team’s knowledge is now “lost wisdom” and often unrecoverable.

The best ways to minimize such problems are to have a formal system of training for incident responders, to place new team members in teams with experienced responders whenever possible, and to have adequate documentation available to response teams. Documentation to ensure consistent, repeatable best practices should include some kind of basic manual of procedures and a well-indexed, easily searchable, cross-referenced database of past incidents, such as a runbook.

Inadequate preparation for a major new rollout.

A new release of a major application is coming up — or maybe it’s an entirely new application or service — is your response team ready to handle the volume of alerts if it turns out that the developers missed a few serious bugs? Murphy’s Law is, after all, waiting around every corner. All it takes is an update to a widely used program and one or two bugs of the kind that produce a cascade of not-so-easy-to-trace errors. If your response team isn’t prepared, you may find that all of your time and resources are taken up by a storm of high-priority alerts, leaving you with very few reserves for handling any other unrelated incidents that may come up. 

Ideally, of course, the update will be adequately tested before full release, with some kind of limited A/B or canary type of deployment. As long as the response team is part of this deployment, they will have the opportunity to deal with problems that do arise on a much smaller scale. The decision to start with a limited deployment, however, is likely to be out of the hands of the incident-response team, and they may have to deal with an inadequately tested release going directly to full deployment.

When this happens, it may be necessary to place all responders on-call — or to designate a special team to handle all update-related problems, freeing up at least some responders to handle unrelated, unplanned issues that also must be addressed. Which approach works best depends at least in part on the scope of the update, and the response team resources that are available. However, plans can always be iterated on as needed and having some plan in place will make a significant difference in comparison to being completely unprepared.

Clearing Things Up

There are plenty of other bottlenecks, of course, including those that arise from outdated, failure-prone, or overloaded infrastructure, as well as the kind that result from response team time being co-opted for non-incident-related tasks. But the bottlenecks that we’ve listed account for much of the time lost by incident response teams, and the remediation approaches that we’ve suggested help to clear up most of them.


Try PagerDuty free for 14 days!


The post Avoiding Incident Response Bottlenecks appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
@GonzalezCarmen has been ranked the Number One Influencer and @ThingsExpo has been named the Number One Brand in the “M2M 2016: Top 100 Influencers and Brands” by Analytic. Onalytica analyzed tweets over the last 6 months mentioning the keywords M2M OR “Machine to Machine.” They then identified the top 100 most influential brands and individuals leading the discussion on Twitter.
SYS-CON Events announced today that Grape Up will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct. 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company specializing in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the U.S. and Europe, Grape Up works with a variety of customers from emergi...
The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place June 6-8, 2017, at the Javits Center in New York City, New York, is co-located with 20th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry p...
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in compute, storage and networking technologies, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced server Building Block Solutions® for Data Center, Cloud Computing, Enterprise IT, Hadoop/...
Amazon has gradually rolled out parts of its IoT offerings in the last year, but these are just the tip of the iceberg. In addition to optimizing their back-end AWS offerings, Amazon is laying the ground work to be a major force in IoT – especially in the connected home and office. Amazon is extending its reach by building on its dominant Cloud IoT platform, its Dash Button strategy, recently announced Replenishment Services, the Echo/Alexa voice recognition control platform, the 6-7 strategic...
Bert Loomis was a visionary. This general session will highlight how Bert Loomis and people like him inspire us to build great things with small inventions. In their general session at 19th Cloud Expo, Harold Hannon, Architect at IBM Bluemix, and Michael O'Neill, Strategic Business Development at Nvidia, discussed the accelerating pace of AI development and how IBM Cloud and NVIDIA are partnering to bring AI capabilities to "every day," on-demand. They also reviewed two "free infrastructure" pr...
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).
Everyone wants to use containers, but monitoring containers is hard. New ephemeral architecture introduces new challenges in how monitoring tools need to monitor and visualize containers, so your team can make sense of everything. In his session at @DevOpsSummit, David Gildeh, co-founder and CEO of Outlyer, will go through the challenges and show there is light at the end of the tunnel if you use the right tools and understand what you need to be monitoring to successfully use containers in your...
Developers want to create better apps faster. Static clouds are giving way to scalable systems, with dynamic resource allocation and application monitoring. You won't hear that chant from users on any picket line, but helping developers to create better apps faster is the mission of Lee Atchison, principal cloud architect and advocate at New Relic Inc., based in San Francisco. His singular job is to understand and drive the industry in the areas of cloud architecture, microservices, scalability ...
Data is an unusual currency; it is not restricted by the same transactional limitations as money or people. In fact, the more that you leverage your data across multiple business use cases, the more valuable it becomes to the organization. And the same can be said about the organization’s analytics. In his session at 19th Cloud Expo, Bill Schmarzo, CTO for the Big Data Practice at Dell EMC, introduced a methodology for capturing, enriching and sharing data (and analytics) across the organization...
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market across the USA and Europe, we work with a variety of customers from emerging startups to Fortune 1000 companies.
Financial Technology has become a topic of intense interest throughout the cloud developer and enterprise IT communities. Accordingly, attendees at the upcoming 20th Cloud Expo at the Javits Center in New York, June 6-8, 2017, will find fresh new content in a new track called FinTech.
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...