Blog Feed Post

Optimizing Your Alert Management Process

https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 300w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 250w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 180w" sizes="(max-width: 233px) 100vw, 233px" />In a simpler world, all alerts would be created equal and your infrastructure would either be completely working or completely broken — with no middle ground.

In reality, however, the world is not that simple. Especially not today, when infrastructure is more diverse and complex than ever.

Coping with that complexity requires a different approach to monitoring and alert management. You need to do much more than treat incident management as a process of responding to alerts in the order they come in or assuming that every alert requires action.

This post explains why a flexible, nuanced approach to alert management is vital, and how to implement it.

Modern Infrastructure is Complex

To understand why a flexible alert management process is essential, let’s examine the factors that make modern infrastructure complex. Consider the following points:

Infrastructure is heavily layered and interdependent

Back in the day, you had a bunch of bare-metal servers and workstations, and that was about it. Today, in the age of software-defined everything, your infrastructure is a complex stack of physical and virtual machines, software-defined networks, thin clients, intermittently connected sensors, and so on — all intertwined and layered atop one another. As a result, an alert that appears to originate from one source (like a Dockerized application) could actually be rooted in a problem on a different part of the infrastructure (like the storage array to which your Docker host server is connected).

Some problems are more serious than others

This is pretty obvious to anyone who has any experience in incident management. Still, it’s worth emphasizing just how broad the range of problems today can be and how hard it is to interpret the severity of an alert quickly. For instance, an alert telling you that a storage server has stopped responding may seem very serious at first glance. But if the server is part of a scaled-out storage cluster with automatic failover, the downtime is not actually high priority. No data is likely to be lost and no business continuity will be interrupted if the team does not respond to the issue right away. Additionally, some alerts serve as warnings but are not immediately actionable. While that information should be kept for pattern and anomaly detection at the infrastructure wide level, it should be suppressed instead of triggering a human response to prevent alert fatigue.

Real-time response is crucial

In today’s always-on world, users will find out about service failures in real-time. The alert management process therefore needs to happen in real-time, too. The fact that users tend to report problems in public places like social media channels before contacting to your company makes real-time resolution even more imperative. Be proactive instead of reactive; you don’t want to wait until your customers have generated a stream of angry Tweets before you get around to responding to a serious alert.

Application performance matters

It’s no longer enough to simply make sure your applications are running. You also need them to be performing at their best, since users have little patience for poor performance. If your website is slow, for example, customers will go elsewhere after as few as ten seconds of waiting. What this means from an alerting perspective is that being notified when an application has stopped responding completely is not sufficient. While uptime monitoring is crucial, you also must receive alerts about poor performance. Moreover, you need to be able to differentiate them from no-response alerts.

Making Nuanced Alerting Work in Practice

Now that you know the challenges of modern alert management, how can you solve them?

The answer is to make your alert management process very flexible, more agile. Use strategies such as the following:

Make high-priority alerts highly visible

In order to react to the most serious alerts quickly, you need to be able to see them easily. That’s hard to do if high- and low-priority alerts are mixed together on your monitoring dashboards. It becomes much easier if you dedicate a dashboard to alerts that your monitoring software marks as high-priority.

Suppress unhelpful alerts

Eliminating unhelpful alerts will also do much to declutter your dashboards and increase visibility. You can do that by suppressing alerts for low-priority events, like the creation of a new user account. The advantage of suppressing such alerts, rather than disabling them completely, is that the alerts still happen and can be consulted if necessary, but they don’t distract admins when there are more pressing alerts to handle.

Nuanced alert reporting and suppression

It’s important to keep in mind that suppression does not have to be an either/or proposition. You can suppress some alerts of a certain type under certain circumstances, but choose not to suppress them under others.

For example, maybe you want to suppress alerts related to account creation if they occur during business hours, when staff would normally be creating accounts, but make those alerts visible if they occur outside of that window. Or maybe you want to suppress alerts about a server reboot unless the reboots happen more than three times within a fixed period.

It is also crucial to de-duplicate wherever possible, as well as create associations between related alerts to prevent redundant resolution and communication efforts.

To minimize alert noise without missing important events, you should triage alerts in a more refined way by implementing mechanisms such as suppression, grouping related alerts, and customizing notification thresholds.

Send different alerts to different people

An alert management process that directs all alerts to all members of the team is inefficient. Different types of alerts should be directed to different team members according to the their respective skillsets and availability. The fact that the latter variable is a changing one makes it even more important to be able to dispatch alerts flexibly. A subject matter expert who is available and ready to manage an incident one hour may go off duty the next.

By sending alerts to the right people from the start, you eliminate much of the manual work that would otherwise be necessary to triage issues and assign them to staff.

Report on more than just downtime

As noted above, successful alert management today means detecting slow performance, not just total failures. For this reason, it’s important to configure monitoring software to generate alerts when systems are approaching the limits of their capacity (when network load exceeds 80 percent, for instance, or demand for an application reaches an unusual threshold but has not yet surpassed it).

Of course, you do not have to give these types of alerts the same priority as alerts that signal complete failure. The latter incidents would be more important to know about and handle immediately. But you also don’t want to wait until something breaks completely before responding to it. Instead, optimize your alert process so that you can deal with performance problems long before they turn into downtime.

In the DevOps age, infrastructure is agile. Your alert management process needs to be, too. The days of assuming that all alerts are of equal importance, or that every alert needs to be reported and reviewed are over. Monitoring the complex, ever-changing infrastructure of today without becoming overwhelmed requires an optimized approach to alerting, which streamlines an IT organization’s ability to identify and interpret alerts according to their level of importance.


The post Optimizing Your Alert Management Process appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that Cloud Academy named "Bronze Sponsor" of 21st International Cloud Expo which will take place October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud com...
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
Multiple data types are pouring into IoT deployments. Data is coming in small packages as well as enormous files and data streams of many sizes. Widespread use of mobile devices adds to the total. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists looked at the tools and environments that are being put to use in IoT deployments, as well as the team skills a modern enterprise IT shop needs to keep things running, get a handle on all this data, and deliver...
After more than five years of DevOps, definitions are evolving, boundaries are expanding, ‘unicorns’ are no longer rare, enterprises are on board, and pundits are moving on. Can we now look at an evolution of DevOps? Should we? Is the foundation of DevOps ‘done’, or is there still too much left to do? What is mature, and what is still missing? What does the next 5 years of DevOps look like? In this Power Panel at DevOps Summit, moderated by DevOps Summit Conference Chair Andi Mann, panelists loo...
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), provided an overview of various initiatives to certify the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldwide re...
While DevOps most critically and famously fosters collaboration, communication, and integration through cultural change, culture is more of an output than an input. In order to actively drive cultural evolution, organizations must make substantial organizational and process changes, and adopt new technologies, to encourage a DevOps culture. Moderated by Andi Mann, panelists discussed how to balance these three pillars of DevOps, where to focus attention (and resources), where organizations might...
Amazon started as an online bookseller 20 years ago. Since then, it has evolved into a technology juggernaut that has disrupted multiple markets and industries and touches many aspects of our lives. It is a relentless technology and business model innovator driving disruption throughout numerous ecosystems. Amazon’s AWS revenues alone are approaching $16B a year making it one of the largest IT companies in the world. With dominant offerings in Cloud, IoT, eCommerce, Big Data, AI, Digital Assista...
New competitors, disruptive technologies, and growing expectations are pushing every business to both adopt and deliver new digital services. This ‘Digital Transformation’ demands rapid delivery and continuous iteration of new competitive services via multiple channels, which in turn demands new service delivery techniques – including DevOps. In this power panel at @DevOpsSummit 20th Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, panelists examined how DevOps helps to meet the de...
Both SaaS vendors and SaaS buyers are going “all-in” to hyperscale IaaS platforms such as AWS, which is disrupting the SaaS value proposition. Why should the enterprise SaaS consumer pay for the SaaS service if their data is resident in adjacent AWS S3 buckets? If both SaaS sellers and buyers are using the same cloud tools, automation and pay-per-transaction model offered by IaaS platforms, then why not host the “shrink-wrapped” software in the customers’ cloud? Further, serverless computing, cl...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
The taxi industry never saw Uber coming. Startups are a threat to incumbents like never before, and a major enabler for startups is that they are instantly “cloud ready.” If innovation moves at the pace of IT, then your company is in trouble. Why? Because your data center will not keep up with frenetic pace AWS, Microsoft and Google are rolling out new capabilities. In his session at 20th Cloud Expo, Don Browning, VP of Cloud Architecture at Turner, posited that disruption is inevitable for comp...
No hype cycles or predictions of zillions of things here. IoT is big. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, Associate Partner at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He discussed the evaluation of communication standards and IoT messaging protocols, data analytics considerations, edge-to-cloud tec...
"When we talk about cloud without compromise what we're talking about is that when people think about 'I need the flexibility of the cloud' - it's the ability to create applications and run them in a cloud environment that's far more flexible,” explained Matthew Finnie, CTO of Interoute, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.