Blog Feed Post

Optimizing Your Alert Management Process

https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 300w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 250w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 180w" sizes="(max-width: 233px) 100vw, 233px" />In a simpler world, all alerts would be created equal and your infrastructure would either be completely working or completely broken — with no middle ground.

In reality, however, the world is not that simple. Especially not today, when infrastructure is more diverse and complex than ever.

Coping with that complexity requires a different approach to monitoring and alert management. You need to do much more than treat incident management as a process of responding to alerts in the order they come in or assuming that every alert requires action.

This post explains why a flexible, nuanced approach to alert management is vital, and how to implement it.

Modern Infrastructure is Complex

To understand why a flexible alert management process is essential, let’s examine the factors that make modern infrastructure complex. Consider the following points:

Infrastructure is heavily layered and interdependent

Back in the day, you had a bunch of bare-metal servers and workstations, and that was about it. Today, in the age of software-defined everything, your infrastructure is a complex stack of physical and virtual machines, software-defined networks, thin clients, intermittently connected sensors, and so on — all intertwined and layered atop one another. As a result, an alert that appears to originate from one source (like a Dockerized application) could actually be rooted in a problem on a different part of the infrastructure (like the storage array to which your Docker host server is connected).

Some problems are more serious than others

This is pretty obvious to anyone who has any experience in incident management. Still, it’s worth emphasizing just how broad the range of problems today can be and how hard it is to interpret the severity of an alert quickly. For instance, an alert telling you that a storage server has stopped responding may seem very serious at first glance. But if the server is part of a scaled-out storage cluster with automatic failover, the downtime is not actually high priority. No data is likely to be lost and no business continuity will be interrupted if the team does not respond to the issue right away. Additionally, some alerts serve as warnings but are not immediately actionable. While that information should be kept for pattern and anomaly detection at the infrastructure wide level, it should be suppressed instead of triggering a human response to prevent alert fatigue.

Real-time response is crucial

In today’s always-on world, users will find out about service failures in real-time. The alert management process therefore needs to happen in real-time, too. The fact that users tend to report problems in public places like social media channels before contacting to your company makes real-time resolution even more imperative. Be proactive instead of reactive; you don’t want to wait until your customers have generated a stream of angry Tweets before you get around to responding to a serious alert.

Application performance matters

It’s no longer enough to simply make sure your applications are running. You also need them to be performing at their best, since users have little patience for poor performance. If your website is slow, for example, customers will go elsewhere after as few as ten seconds of waiting. What this means from an alerting perspective is that being notified when an application has stopped responding completely is not sufficient. While uptime monitoring is crucial, you also must receive alerts about poor performance. Moreover, you need to be able to differentiate them from no-response alerts.

Making Nuanced Alerting Work in Practice

Now that you know the challenges of modern alert management, how can you solve them?

The answer is to make your alert management process very flexible, more agile. Use strategies such as the following:

Make high-priority alerts highly visible

In order to react to the most serious alerts quickly, you need to be able to see them easily. That’s hard to do if high- and low-priority alerts are mixed together on your monitoring dashboards. It becomes much easier if you dedicate a dashboard to alerts that your monitoring software marks as high-priority.

Suppress unhelpful alerts

Eliminating unhelpful alerts will also do much to declutter your dashboards and increase visibility. You can do that by suppressing alerts for low-priority events, like the creation of a new user account. The advantage of suppressing such alerts, rather than disabling them completely, is that the alerts still happen and can be consulted if necessary, but they don’t distract admins when there are more pressing alerts to handle.

Nuanced alert reporting and suppression

It’s important to keep in mind that suppression does not have to be an either/or proposition. You can suppress some alerts of a certain type under certain circumstances, but choose not to suppress them under others.

For example, maybe you want to suppress alerts related to account creation if they occur during business hours, when staff would normally be creating accounts, but make those alerts visible if they occur outside of that window. Or maybe you want to suppress alerts about a server reboot unless the reboots happen more than three times within a fixed period.

It is also crucial to de-duplicate wherever possible, as well as create associations between related alerts to prevent redundant resolution and communication efforts.

To minimize alert noise without missing important events, you should triage alerts in a more refined way by implementing mechanisms such as suppression, grouping related alerts, and customizing notification thresholds.

Send different alerts to different people

An alert management process that directs all alerts to all members of the team is inefficient. Different types of alerts should be directed to different team members according to the their respective skillsets and availability. The fact that the latter variable is a changing one makes it even more important to be able to dispatch alerts flexibly. A subject matter expert who is available and ready to manage an incident one hour may go off duty the next.

By sending alerts to the right people from the start, you eliminate much of the manual work that would otherwise be necessary to triage issues and assign them to staff.

Report on more than just downtime

As noted above, successful alert management today means detecting slow performance, not just total failures. For this reason, it’s important to configure monitoring software to generate alerts when systems are approaching the limits of their capacity (when network load exceeds 80 percent, for instance, or demand for an application reaches an unusual threshold but has not yet surpassed it).

Of course, you do not have to give these types of alerts the same priority as alerts that signal complete failure. The latter incidents would be more important to know about and handle immediately. But you also don’t want to wait until something breaks completely before responding to it. Instead, optimize your alert process so that you can deal with performance problems long before they turn into downtime.

In the DevOps age, infrastructure is agile. Your alert management process needs to be, too. The days of assuming that all alerts are of equal importance, or that every alert needs to be reported and reviewed are over. Monitoring the complex, ever-changing infrastructure of today without becoming overwhelmed requires an optimized approach to alerting, which streamlines an IT organization’s ability to identify and interpret alerts according to their level of importance.


The post Optimizing Your Alert Management Process appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
Information technology (IT) advances are transforming the way we innovate in business, thereby disrupting the old guard and their predictable status-quo. It’s creating global market turbulence. Industries are converging, and new opportunities and threats are emerging, like never before. So, how are savvy chief information officers (CIOs) leading this transition? Back in 2015, the IBM Institute for Business Value conducted a market study that included the findings from over 1,800 CIO interviews ...
Virtualization over the past years has become a key strategy for IT to acquire multi-tenancy, increase utilization, develop elasticity and improve security. And virtual machines (VMs) are quickly becoming a main vehicle for developing and deploying applications. The introduction of containers seems to be bringing another and perhaps overlapped solution for achieving the same above-mentioned benefits. Are a container and a virtual machine fundamentally the same or different? And how? Is one techn...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
What sort of WebRTC based applications can we expect to see over the next year and beyond? One way to predict development trends is to see what sorts of applications startups are building. In his session at @ThingsExpo, Arin Sime, founder of WebRTC.ventures, will discuss the current and likely future trends in WebRTC application development based on real requests for custom applications from real customers, as well as other public sources of information,
As businesses adopt functionalities in cloud computing, it’s imperative that IT operations consistently ensure cloud systems work correctly – all of the time, and to their best capabilities. In his session at @BigDataExpo, Bernd Harzog, CEO and founder of OpsDataStore, will present an industry answer to the common question, “Are you running IT operations as efficiently and as cost effectively as you need to?” He will expound on the industry issues he frequently came up against as an analyst, and...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
ChatOps is an emerging topic that has led to the wide availability of integrations between group chat and various other tools/platforms. Currently, HipChat is an extremely powerful collaboration platform due to the various ChatOps integrations that are available. However, DevOps automation can involve orchestration and complex workflows. In his session at @DevOpsSummit at 20th Cloud Expo, Himanshu Chhetri, CTO at Addteq, will cover practical examples and use cases such as self-provisioning infra...
The financial services market is one of the most data-driven industries in the world, yet it’s bogged down by legacy CPU technologies that simply can’t keep up with the task of querying and visualizing billions of records. In his session at 20th Cloud Expo, Jared Parker, Director of Financial Services at Kinetica, will discuss how the advent of advanced in-database analytics on the GPU makes it possible to run sophisticated data science workloads on the same database that is housing the rich inf...
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
Apache Hadoop is emerging as a distributed platform for handling large and fast incoming streams of data. Predictive maintenance, supply chain optimization, and Internet-of-Things analysis are examples where Hadoop provides the scalable storage, processing, and analytics platform to gain meaningful insights from granular data that is typically only valuable from a large-scale, aggregate view. One architecture useful for capturing and analyzing streaming data is the Lambda Architecture, represent...
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Things are changing so quickly in IoT that it would take a wizard to predict which ecosystem will gain the most traction. In order for IoT to reach its potential, smart devices must be able to work together. Today, there are a slew of interoperability standards being promoted by big names to make this happen: HomeKit, Brillo and Alljoyn. In his session at @ThingsExpo, Adam Justice, vice president and general manager of Grid Connect, will review what happens when smart devices don’t work togethe...
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
Adding public cloud resources to an existing application can be a daunting process. The tools that you currently use to manage the software and hardware outside the cloud aren’t always the best tools to efficiently grow into the cloud. All of the major configuration management tools have cloud orchestration plugins that can be leveraged, but there are also cloud-native tools that can dramatically improve the efficiency of managing your application lifecycle.