Blog Feed Post

Determining Incident Priority

https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 250w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 180w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 508w" sizes="(max-width: 300px) 100vw, 300px" />

Alerts. It’s so easy for them to pile up. One moment, you’re looking at a handful of alerts. A few hours — or maybe even minutes — later, you’re looking at a mountain. How do you manage them and keep your responders from being completely overwhelmed?

These are hugely important questions. If your alert management system is flooded with noise and response teams are in a permanent state of alert fatigue, you may as well not even have an IT alert management system in the first place. Excessive noise and alert fatigue completely reduce the effectiveness of the alert management system.

Apply Filtering: Alerts to Incidents

In many ways, the key to streamlining your alert management system lies in a rapid and accurate method for consolidating related alerts into incidents and determining incident priority. Sorting incidents by urgency provides an automatic filter for most noise and it provides you with a reasonable approximation of what needs immediate attention, and what can wait. Also keep in mind that not every alert needs an incident or a response — suppressing non-actionable alerts further cuts down the noise and lets you focus on what matters.

You will probably be able to automate at least part of the sorting process (for example, by source and keywords), although it is likely that some (and perhaps a considerable amount) of it will require monitoring and intervention by response team members operating in the dispatcher role. Whatever method you use, however, the basic criteria will remain the same.

Most priority schemes follow the ITIL incident prioritization guidelines, or something similar. One of the key elements of the ITIL guidelines is that incident priority is based on two closely related factors: impact and urgency. In this post, we’ll take a closer look at both of those factors, and how they interact.

Determine Incident Impact

Impact is generally based on the scope of an incident’s effects — how many departments, users, or key services are affected. It can be relatively easy to automate at least some elements of the impact determination process. A large number of near-simultaneous reports that a specific service is unavailable, for example, may be a good indication of a high-impact incident, while a report of a problem from a single user, unaccompanied by any similar reports, is more likely to indicate a low-impact incident. For many IT departments, the guidelines for determining incident impact might look something like this:

  • High impact:
    • A critical system is down.
    • One or more departments is affected.
    • A significant number of staff members are not able to perform their functions.
    • The incident affects a large number of customers.
    • The incident has the potential for major financial loss or damage to the organization’s reputation.
    • Other criteria, depending on the function of the organization and the affected systems, could include such things as threat to public safety, potential loss of life, or major property damage.
  • Moderate impact:
    • Some staff members or customers are affected.
    • None of the services lost are critical.
    • Financial loss and damage to the organization’s reputation are possible, but limited in scope.
    • There is no threat to life, public safety, or physical property.
  • Low impact:
    • Only a small number of users are affected.
    • No critical services are involved, and there is little or no potential for financial loss or loss of reputation.

Incident Urgency

It is not always easy to draw a strict distinction between incident impact and incident urgency, but for the most part, urgency in this context can be defined as how quickly a problem will begin to have an effect on the system. The failure of a payroll system may have a high impact, for example, but if it occurs at the beginning of a pay cycle, it is likely to be less urgent than the loss of a customer relations database which is put to heavy use on a daily basis.

  • High urgency:
    • A service which is critical for day-to-day operations is unavailable.
    • The incident’s sphere of impact is expanding rapidly, or quick action may make it possible to limit its scope.
    • Time-sensitive work or customer actions are affected.
    • The incident affects high-status individuals or organizations (i.e., upper management or major clients).
  • Low urgency:
    • Affected services are optional and used infrequently.
    • The effects of the incident appear to be stable.
    • Important or time-sensitive work is not affected.

Note that for both impact and urgency, meeting a single criterion (rather than all or a majority of criteria) for a category is generally sufficient. Incidents should be placed in the highest category for which they qualify.

Priority = Impact + Urgency

At this point, it should be pretty easy to see that priority is a direct function of both impact and urgency. Regardless of the alert management and incident dispatching processes you put into place, should they route based on criteria for determining priority, you’ll be able to hush a considerable amount of alert noise, and low-impact, low-urgency events will naturally sink to the low end of your priority list. This will allow your incident response teams to concentrate on the kind of high-impact, high-priority incidents which genuinely require the most attention — with very little distraction or alert fatigue.

To learn more about how to aggregate, classify, and suppress events to manage what matters, check out PagerDuty’s alert triage and event rules engine. You can also easily classify incidents based on your organization’s custom definitions of priority.

And that mountain of alerts? By focusing on what’s actionable and urgent — especially with the help of a solution like PagerDuty — you may just find that it isn’t there anymore!

The post Determining Incident Priority appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
"ZeroStack is a startup in Silicon Valley. We're solving a very interesting problem around bringing public cloud convenience with private cloud control for enterprises and mid-size companies," explained Kamesh Pemmaraju, VP of Product Management at ZeroStack, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"Codigm is based on the cloud and we are here to explore marketing opportunities in America. Our mission is to make an ecosystem of the SW environment that anyone can understand, learn, teach, and develop the SW on the cloud," explained Sung Tae Ryu, CEO of Codigm, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
High-velocity engineering teams are applying not only continuous delivery processes, but also lessons in experimentation from established leaders like Amazon, Netflix, and Facebook. These companies have made experimentation a foundation for their release processes, allowing them to try out major feature releases and redesigns within smaller groups before making them broadly available. In his session at 21st Cloud Expo, Brian Lucas, Senior Staff Engineer at Optimizely, discussed how by using ne...
"CA has been doing a lot of things in the area of DevOps. Now we have a complete set of tool sets in order to enable customers to go all the way from planning to development to testing down to release into the operations," explained Aruna Ravichandran, Vice President of Global Marketing and Strategy at CA Technologies, in this SYS-CON.tv interview at DevOps Summit at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
Data scientists must access high-performance computing resources across a wide-area network. To achieve cloud-based HPC visualization, researchers must transfer datasets and visualization results efficiently. HPC clusters now compute GPU-accelerated visualization in the cloud cluster. To efficiently display results remotely, a high-performance, low-latency protocol transfers the display from the cluster to a remote desktop. Further, tools to easily mount remote datasets and efficiently transfer...
"Infoblox does DNS, DHCP and IP address management for not only enterprise networks but cloud networks as well. Customers are looking for a single platform that can extend not only in their private enterprise environment but private cloud, public cloud, tracking all the IP space and everything that is going on in that environment," explained Steve Salo, Principal Systems Engineer at Infoblox, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventio...
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Agile has finally jumped the technology shark, expanding outside the software world. Enterprises are now increasingly adopting Agile practices across their organizations in order to successfully navigate the disruptive waters that threaten to drown them. In our quest for establishing change as a core competency in our organizations, this business-centric notion of Agile is an essential component of Agile Digital Transformation. In the years since the publication of the Agile Manifesto, the conn...