Blog Feed Post

Optimizing Your Alert Management Process

https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 300w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 250w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 180w" sizes="(max-width: 233px) 100vw, 233px" />In a simpler world, all alerts would be created equal and your infrastructure would either be completely working or completely broken — with no middle ground.

In reality, however, the world is not that simple. Especially not today, when infrastructure is more diverse and complex than ever.

Coping with that complexity requires a different approach to monitoring and alert management. You need to do much more than treat incident management as a process of responding to alerts in the order they come in or assuming that every alert requires action.

This post explains why a flexible, nuanced approach to alert management is vital, and how to implement it.

Modern Infrastructure is Complex

To understand why a flexible alert management process is essential, let’s examine the factors that make modern infrastructure complex. Consider the following points:

Infrastructure is heavily layered and interdependent

Back in the day, you had a bunch of bare-metal servers and workstations, and that was about it. Today, in the age of software-defined everything, your infrastructure is a complex stack of physical and virtual machines, software-defined networks, thin clients, intermittently connected sensors, and so on — all intertwined and layered atop one another. As a result, an alert that appears to originate from one source (like a Dockerized application) could actually be rooted in a problem on a different part of the infrastructure (like the storage array to which your Docker host server is connected).

Some problems are more serious than others

This is pretty obvious to anyone who has any experience in incident management. Still, it’s worth emphasizing just how broad the range of problems today can be and how hard it is to interpret the severity of an alert quickly. For instance, an alert telling you that a storage server has stopped responding may seem very serious at first glance. But if the server is part of a scaled-out storage cluster with automatic failover, the downtime is not actually high priority. No data is likely to be lost and no business continuity will be interrupted if the team does not respond to the issue right away. Additionally, some alerts serve as warnings but are not immediately actionable. While that information should be kept for pattern and anomaly detection at the infrastructure wide level, it should be suppressed instead of triggering a human response to prevent alert fatigue.

Real-time response is crucial

In today’s always-on world, users will find out about service failures in real-time. The alert management process therefore needs to happen in real-time, too. The fact that users tend to report problems in public places like social media channels before contacting to your company makes real-time resolution even more imperative. Be proactive instead of reactive; you don’t want to wait until your customers have generated a stream of angry Tweets before you get around to responding to a serious alert.

Application performance matters

It’s no longer enough to simply make sure your applications are running. You also need them to be performing at their best, since users have little patience for poor performance. If your website is slow, for example, customers will go elsewhere after as few as ten seconds of waiting. What this means from an alerting perspective is that being notified when an application has stopped responding completely is not sufficient. While uptime monitoring is crucial, you also must receive alerts about poor performance. Moreover, you need to be able to differentiate them from no-response alerts.

Making Nuanced Alerting Work in Practice

Now that you know the challenges of modern alert management, how can you solve them?

The answer is to make your alert management process very flexible, more agile. Use strategies such as the following:

Make high-priority alerts highly visible

In order to react to the most serious alerts quickly, you need to be able to see them easily. That’s hard to do if high- and low-priority alerts are mixed together on your monitoring dashboards. It becomes much easier if you dedicate a dashboard to alerts that your monitoring software marks as high-priority.

Suppress unhelpful alerts

Eliminating unhelpful alerts will also do much to declutter your dashboards and increase visibility. You can do that by suppressing alerts for low-priority events, like the creation of a new user account. The advantage of suppressing such alerts, rather than disabling them completely, is that the alerts still happen and can be consulted if necessary, but they don’t distract admins when there are more pressing alerts to handle.

Nuanced alert reporting and suppression

It’s important to keep in mind that suppression does not have to be an either/or proposition. You can suppress some alerts of a certain type under certain circumstances, but choose not to suppress them under others.

For example, maybe you want to suppress alerts related to account creation if they occur during business hours, when staff would normally be creating accounts, but make those alerts visible if they occur outside of that window. Or maybe you want to suppress alerts about a server reboot unless the reboots happen more than three times within a fixed period.

It is also crucial to de-duplicate wherever possible, as well as create associations between related alerts to prevent redundant resolution and communication efforts.

To minimize alert noise without missing important events, you should triage alerts in a more refined way by implementing mechanisms such as suppression, grouping related alerts, and customizing notification thresholds.

Send different alerts to different people

An alert management process that directs all alerts to all members of the team is inefficient. Different types of alerts should be directed to different team members according to the their respective skillsets and availability. The fact that the latter variable is a changing one makes it even more important to be able to dispatch alerts flexibly. A subject matter expert who is available and ready to manage an incident one hour may go off duty the next.

By sending alerts to the right people from the start, you eliminate much of the manual work that would otherwise be necessary to triage issues and assign them to staff.

Report on more than just downtime

As noted above, successful alert management today means detecting slow performance, not just total failures. For this reason, it’s important to configure monitoring software to generate alerts when systems are approaching the limits of their capacity (when network load exceeds 80 percent, for instance, or demand for an application reaches an unusual threshold but has not yet surpassed it).

Of course, you do not have to give these types of alerts the same priority as alerts that signal complete failure. The latter incidents would be more important to know about and handle immediately. But you also don’t want to wait until something breaks completely before responding to it. Instead, optimize your alert process so that you can deal with performance problems long before they turn into downtime.

In the DevOps age, infrastructure is agile. Your alert management process needs to be, too. The days of assuming that all alerts are of equal importance, or that every alert needs to be reported and reviewed are over. Monitoring the complex, ever-changing infrastructure of today without becoming overwhelmed requires an optimized approach to alerting, which streamlines an IT organization’s ability to identify and interpret alerts according to their level of importance.


The post Optimizing Your Alert Management Process appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone in...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices t...
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to clos...
"I focus on what we are calling CAST Highlight, which is our SaaS application portfolio analysis tool. It is an extremely lightweight tool that can integrate with pretty much any build process right now," explained Andrew Siegmund, Application Migration Specialist for CAST, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, discussed how they built...
The dynamic nature of the cloud means that change is a constant when it comes to modern cloud-based infrastructure. Delivering modern applications to end users, therefore, is a constantly shifting challenge. Delivery automation helps IT Ops teams ensure that apps are providing an optimal end user experience over hybrid-cloud and multi-cloud environments, no matter what the current state of the infrastructure is. To employ a delivery automation strategy that reflects your business rules, making r...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
The past few years have brought a sea change in the way applications are architected, developed, and consumed—increasing both the complexity of testing and the business impact of software failures. How can software testing professionals keep pace with modern application delivery, given the trends that impact both architectures (cloud, microservices, and APIs) and processes (DevOps, agile, and continuous delivery)? This is where continuous testing comes in. D
No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve f...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...