Blog Feed Post

Determining Incident Priority

https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 250w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 180w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 508w" sizes="(max-width: 300px) 100vw, 300px" />

Alerts. It’s so easy for them to pile up. One moment, you’re looking at a handful of alerts. A few hours — or maybe even minutes — later, you’re looking at a mountain. How do you manage them and keep your responders from being completely overwhelmed?

These are hugely important questions. If your alert management system is flooded with noise and response teams are in a permanent state of alert fatigue, you may as well not even have an IT alert management system in the first place. Excessive noise and alert fatigue completely reduce the effectiveness of the alert management system.

Apply Filtering: Alerts to Incidents

In many ways, the key to streamlining your alert management system lies in a rapid and accurate method for consolidating related alerts into incidents and determining incident priority. Sorting incidents by urgency provides an automatic filter for most noise and it provides you with a reasonable approximation of what needs immediate attention, and what can wait. Also keep in mind that not every alert needs an incident or a response — suppressing non-actionable alerts further cuts down the noise and lets you focus on what matters.

You will probably be able to automate at least part of the sorting process (for example, by source and keywords), although it is likely that some (and perhaps a considerable amount) of it will require monitoring and intervention by response team members operating in the dispatcher role. Whatever method you use, however, the basic criteria will remain the same.

Most priority schemes follow the ITIL incident prioritization guidelines, or something similar. One of the key elements of the ITIL guidelines is that incident priority is based on two closely related factors: impact and urgency. In this post, we’ll take a closer look at both of those factors, and how they interact.

Determine Incident Impact

Impact is generally based on the scope of an incident’s effects — how many departments, users, or key services are affected. It can be relatively easy to automate at least some elements of the impact determination process. A large number of near-simultaneous reports that a specific service is unavailable, for example, may be a good indication of a high-impact incident, while a report of a problem from a single user, unaccompanied by any similar reports, is more likely to indicate a low-impact incident. For many IT departments, the guidelines for determining incident impact might look something like this:

  • High impact:
    • A critical system is down.
    • One or more departments is affected.
    • A significant number of staff members are not able to perform their functions.
    • The incident affects a large number of customers.
    • The incident has the potential for major financial loss or damage to the organization’s reputation.
    • Other criteria, depending on the function of the organization and the affected systems, could include such things as threat to public safety, potential loss of life, or major property damage.
  • Moderate impact:
    • Some staff members or customers are affected.
    • None of the services lost are critical.
    • Financial loss and damage to the organization’s reputation are possible, but limited in scope.
    • There is no threat to life, public safety, or physical property.
  • Low impact:
    • Only a small number of users are affected.
    • No critical services are involved, and there is little or no potential for financial loss or loss of reputation.

Incident Urgency

It is not always easy to draw a strict distinction between incident impact and incident urgency, but for the most part, urgency in this context can be defined as how quickly a problem will begin to have an effect on the system. The failure of a payroll system may have a high impact, for example, but if it occurs at the beginning of a pay cycle, it is likely to be less urgent than the loss of a customer relations database which is put to heavy use on a daily basis.

  • High urgency:
    • A service which is critical for day-to-day operations is unavailable.
    • The incident’s sphere of impact is expanding rapidly, or quick action may make it possible to limit its scope.
    • Time-sensitive work or customer actions are affected.
    • The incident affects high-status individuals or organizations (i.e., upper management or major clients).
  • Low urgency:
    • Affected services are optional and used infrequently.
    • The effects of the incident appear to be stable.
    • Important or time-sensitive work is not affected.

Note that for both impact and urgency, meeting a single criterion (rather than all or a majority of criteria) for a category is generally sufficient. Incidents should be placed in the highest category for which they qualify.

Priority = Impact + Urgency

At this point, it should be pretty easy to see that priority is a direct function of both impact and urgency. Regardless of the alert management and incident dispatching processes you put into place, should they route based on criteria for determining priority, you’ll be able to hush a considerable amount of alert noise, and low-impact, low-urgency events will naturally sink to the low end of your priority list. This will allow your incident response teams to concentrate on the kind of high-impact, high-priority incidents which genuinely require the most attention — with very little distraction or alert fatigue.

To learn more about how to aggregate, classify, and suppress events to manage what matters, check out PagerDuty’s alert triage and event rules engine. You can also easily classify incidents based on your organization’s custom definitions of priority.

And that mountain of alerts? By focusing on what’s actionable and urgent — especially with the help of a solution like PagerDuty — you may just find that it isn’t there anymore!

The post Determining Incident Priority appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
"Peak 10 is a hybrid infrastructure provider across the nation. We are in the thick of things when it comes to hybrid IT," explained , Chief Technology Officer at Peak 10, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I think DevOps is now a rambunctious teenager – it’s starting to get a mind of its own, wanting to get its own things but it still needs some adult supervision," explained Thomas Hooker, VP of marketing at CollabNet, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are still a relatively small software house and we are focusing on certain industries like FinTech, med tech, energy and utilities. We help our customers with their digital transformation," noted Piotr Stawinski, Founder and CEO of EARP Integration, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We've been engaging with a lot of customers including Panasonic, we've been involved with Cisco and now we're working with the U.S. government - the Department of Homeland Security," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution and join Akvelon expert and IoT industry leader, Sergey Grebnov, in his session at @ThingsExpo, for an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
Any startup has to have a clear go –to-market strategy from the beginning. Similarly, any data science project has to have a go to production strategy from its first days, so it could go beyond proof-of-concept. Machine learning and artificial intelligence in production would result in hundreds of training pipelines and machine learning models that are continuously revised by teams of data scientists and seamlessly connected with web applications for tenants and users.
"We're here to tell the world about our cloud-scale infrastructure that we have at Juniper combined with the world-class security that we put into the cloud," explained Lisa Guess, VP of Systems Engineering at Juniper Networks, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I will be talking about ChatOps and ChatOps as a way to solve some problems in the DevOps space," explained Himanshu Chhetri, CTO of Addteq, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are an IT services solution provider and we sell software to support those solutions. Our focus and key areas are around security, enterprise monitoring, and continuous delivery optimization," noted John Balsavage, President of A&I Solutions, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, provided a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services with...
The financial services market is one of the most data-driven industries in the world, yet it’s bogged down by legacy CPU technologies that simply can’t keep up with the task of querying and visualizing billions of records. In his session at 20th Cloud Expo, Karthik Lalithraj, a Principal Solutions Architect at Kinetica, discussed how the advent of advanced in-database analytics on the GPU makes it possible to run sophisticated data science workloads on the same database that is housing the rich...
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...
All organizations that did not originate this moment have a pre-existing culture as well as legacy technology and processes that can be more or less amenable to DevOps implementation. That organizational culture is influenced by the personalities and management styles of Executive Management, the wider culture in which the organization is situated, and the personalities of key team members at all levels of the organization. This culture and entrenched interests usually throw a wrench in the work...
"We want to show that our solution is far less expensive with a much better total cost of ownership so we announced several key features. One is called geo-distributed erasure coding, another is support for KVM and we introduced a new capability called Multi-Part," explained Tim Desai, Senior Product Marketing Manager at Hitachi Data Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...