Welcome!

Blog Feed Post

After the Disaster: How to Learn from Historical Incident Management Data

Your high school history teacher no doubt delivered to you some variation on George Santayana’s famous remark that, “those who cannot remember the past are condemned to repeat it.

https://www.pagerduty.com/wp-content/uploads/2016/11/continous-learning-... 250w, https://www.pagerduty.com/wp-content/uploads/2016/11/continous-learning-... 180w, https://www.pagerduty.com/wp-content/uploads/2016/11/continous-learning.png 508w" sizes="(max-width: 300px) 100vw, 300px" />I’m pretty sure Santayana wasn’t thinking about incident management when he wrote that. But his wisdom still applies — and it’s worth heeding if you’re responsible for incident management.

True, the main purpose of incident management is to identify and resolve issues that affect your infrastructure, but your incident management operations shouldn’t stop there. Instead of just reacting to customer tickets, you should also take advantage of the rich volumes of data that your alerting systems generate to proactively detect and prevent issues, so you can gain insights that will help you make your infrastructure more resilient going forward.

In this post, I’ll outline some strategies for working with historical incident management data, including how to collect and analyze data, and what to look for when working with this information.

Save and standardize your data

https://www.pagerduty.com/wp-content/uploads/2016/10/sem-analyze-150x150... 150w, https://www.pagerduty.com/wp-content/uploads/2016/10/sem-analyze-300x300... 300w, https://www.pagerduty.com/wp-content/uploads/2016/10/sem-analyze-250x250... 250w, https://www.pagerduty.com/wp-content/uploads/2016/10/sem-analyze-180x180... 180w" sizes="(max-width: 225px) 100vw, 225px" />The first step in analyzing historical incident management data is finding a standardized way to collect and parse the information. This can be challenging since the amount and format of historical log data varies widely between different monitoring systems.

Some monitoring systems don’t provide much at all in the way of logged data that you can examine after the fact. For example, Pingdom is a great tool for real-time monitoring, but since it was designed to tell you what’s happening now, not what happened yesterday, it doesn’t provide much historical data on its own.

Other monitoring systems keep data for limited periods of time or store it in formats that are hard to work with. For instance, to analyze Snort data, you may need to sift through packet dumps. Unless Wireshark is your favorite way to spend a Friday evening, that’s a lot of work.

Moreover, if you have lots of monitoring systems in place, they probably dump data to a number of scattered locations. Some tools write logs to /var/log on local machines, where they’re hard to find and may be deleted by maintenance scripts. Others keep logs in the cloud for varying lengths of time — not ideal if you want to analyze all of your historical data at once.

For these reasons, in order to make the most of your incident management data, you should make sure to do two things:

  1. Send alerts and logs to a central collection point where they can be stored as long as you need them (rather than as long as the original monitoring system or local storage will support them).
  2. Convert data at your collection point to a standard format — and extract actionable insights and takeaways that can be reinvested into the infrastructure (with a process like incident postmortems).

Tools like Logstash, Splunk and Papertrail can be helpful here. They assist in collecting data from siloed locations and directing it to a central storage point.

PagerDuty takes things a step further by allowing you to import data from these and other sources, converting it to a standardized format, and centralizing and cross-correlating data with visualizations that draw patterns and trends, and can be leveraged to identify root cause and more.

View and analyze your data

Saving your data is only half the battle. The other challenge is how to view and analyze it.

In most cases, the simplest way to view your data is over a web-based interface. Ideally, it’ll feature a sophisticated search that you can use to find specific events from your logs, monitor the current status of incidents, and so on. That’s why being able to filter and search across your entire infrastructure with normalized fields is so helpful.

While the web interface may be good for finding small-scale trends or tracing the history of a specific type of incident, to get the bigger picture you need, well, pictures. Tables and lists of alerts don’t help you understand system-wide trends. Visualizations based on your incident management data, like the kind PagerDuty includes in reports, help you to interpret information on a large scale.

Last but not least — especially if you’re into analyzing data programmatically — are APIs that let you export your log data as needed. The PagerDuty API makes it easy to collect and export log data in whatever format you need (and the Events API v2 also automatically normalizes all that data into a common format).

What to look for

Once you have your data analysis, what should you be looking for? Your exact needs will vary according to the type of infrastructure you’re monitoring, of course, but some general points of information to heed include:

  • The frequency at which incidents are occurring. If this number changes over time, you’ll want to know why.
  • Mean time to acknowledge (MTTA) and mean time to resolve (MTTR) incidents. By keeping track of these numbers, you’ll know how effectively your team is handling its incident management responsibilities.
  • Who on your team is doing the most to handle alerts? Knowing this not only allows you to reward members for their hard work, but awareness will also determine whether your alerts are being distributed properly and going to the right people. For example, if one admin is receiving more than their fair share of alerts, you should tweak things so they don’t become overwhelmed — that leads to alert fatigue, and no one wants that.
  • Which monitoring systems are generating the most alerts? If you amalgamate the alerts from your various monitoring systems into a single logging location, as I suggested above, you can also identify which systems are giving you the most information. You’ll be able to see if a system is underperforming or generating too much noise, and tune your alerting thresholds as needed.

If you follow these tips, you won’t be left repeating history by facing the same types of incidents over and over again. Instead, you’ll be able to identify the big-picture trends, which will help you to find ways to make your infrastructure more effective overall.

And that’s how incident management can really pay off. Remember another oft-quoted maxim — “An ounce of prevention is worth a pound of cure.” Incident response is the cure, but creating a continuous feedback loop with historical incident management data is the best practice that enables prevention.

The post After the Disaster: How to Learn from Historical Incident Management Data appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
Regardless of what business you’re in, it’s increasingly a software-driven business. Consumers’ rising expectations for connected digital and physical experiences are driving what some are calling the "Customer Experience Challenge.” In his session at @DevOpsSummit at 20th Cloud Expo, Marco Morales, Director of Global Solutions at CollabNet, will discuss how organizations are increasingly adopting a discipline of Value Stream Mapping to ensure that the software they are producing is poised to o...
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @CloudExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.
For financial firms, the cloud is going to increasingly become a crucial part of dealing with customers over the next five years and beyond, particularly with the growing use and acceptance of virtual currencies. There are new data storage paradigms on the horizon that will deliver secure solutions for storing and moving sensitive financial data around the world without touching terrestrial networks. In his session at 20th Cloud Expo, Cliff Beek, President of Cloud Constellation Corporation, w...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that EARP Integration will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. EARP Integration is a passionate software house. Since its inception in 2009 the company successfully delivers smart solutions for cities and factories that start their digital transformation. EARP provides bespoke solutions like, for example, advanced enterprise portals, business intelligence systems an...
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 20th Cloud Expo, Tom Eck, Industry Platforms CTO at IBM Cloud, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing developers ...
SYS-CON Events announced today that Outscale, a global pure play Infrastructure as a Service provider and strategic partner of Dassault Systèmes, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2010, Outscale simplifies infrastructure complexities and boosts the business agility of its customers. Outscale delivers a secure, reliable and industrial strength solution for its customers, which in...
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs oft...
Interested in leveling up on your Cloud Foundry skills? Join IBM for Cloud Foundry Days on June 7 at Cloud Expo New York at the Javits Center in New York City. Cloud Foundry Days is a free half day educational conference and networking event. Come find out why Cloud Foundry is the industry's fastest-growing and most adopted cloud application platform.
In order to meet the rapidly changing demands of today’s customers, companies are continually forced to redefine their business strategies in order to meet these needs, stay relevant and continue to see profitable growth. IoT deployment and development is integral in this transformation, and today businesses are increasingly seeing the value of investing their resources into IoT deployments. These technologies are able increase ROI through projects such as connecting supply chains or enabling sm...
Most DevOps journeys involve several phases of maturity. Research shows that the inflection point where organizations begin to see maximum value is when they implement tight integration deploying their code to their infrastructure. Success at this level is the last barrier to at-will deployment. Storage, for instance, is more capable than where we read and write data. In his session at @DevOpsSummit at 20th Cloud Expo, Josh Atwell, a Developer Advocate for NetApp, will discuss the role and value...
As cloud adoption continues to transform business, today's global enterprises are challenged with managing a growing amount of information living outside of the data center. The rapid adoption of IoT and increasingly mobile workforce are exacerbating the problem. Ensuring secure data sharing and efficient backup poses capacity and bandwidth considerations as well as policy and regulatory compliance issues.
SYS-CON Events announced today that Cloud Academy will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud computing technologies. Ge...
When NSA's digital armory was leaked, it was only a matter of time before the code was morphed into a ransom seeking worm. This talk, designed for C-level attendees, demonstrates a Live Hack of a virtual environment to show the ease in which any average user can leverage these tools and infiltrate their network environment. This session will include an overview of the Shadbrokers NSA leak situation.
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.