Blog Feed Post

Keep Critical Apps and Infrastructure Up and Running

“Incident lifecycle management? If we manage to stay alive from one incident to the next, it’s a good day. On a bad day, it’s all panic mode.”

Unfortunately, that’s the reality of incident lifecycle management for far too many software and IT companies — but it doesn’t have to be that way. The truth is that genuine, proactive incident lifecycle management can keep incident-response teams from falling into chronic survival or panic mode.

Incident lifecycle management is a framework for categorizing, responding to, resolving, and documenting incidents so that they can be handled effectively with minimal loss of services and with well-organized follow-up. An end-to-end incident resolution framework is crucial for maintaining critical services.

Customer-Centered Incident Management

Most modern incident management systems are based to one degree or another on the ITIL model, first developed in the 1980s by the British government’s Central Computing and Telecommunications Agency. The ITIL model is centered around maintaining services to clients and customers, as opposed to maintaining key systems strictly according to technical specifications. This makes it an ideal model for incident response in outward-facing applications, where maintenance of user services is of high importance. The most important elements of the ITIL model to keep in mind when setting up an incident lifecycle management framework are:

Initial Response

This is the phase during which incoming alerts are logged, categorized, and routed to the appropriate teams. In many respects, this is the most important part of the incident management lifecycle, because it is when you detect issues and filter out noise (non-actionable alerts), set priorities, and determine where each alert should be routed.

Failure to adequately manage this part of the process can result in important alerts being missed, handled at too-low priority, or routed to the wrong responders, as well as unbalanced workloads for response teams.

Level 1 Response

After an alert has been categorized, it is sent to a Level 1 response team. Level 1 teams are the first responders; their job is to resolve the incident to the customer’s satisfaction, typically within a specified time frame. The Level 1 team will investigate the incident, figure out what the basic problem is, and apply known or recommended remediations wherever possible. 

Level 1 support also monitors the status of the incident, particularly with regard to escalation. Another key responsibility of Level 1 support is to maintain communication with the affected customer or client and provide status updates at intervals which may be set by contract, or by organizational policies. This makes it possible to maintain a consistent channel of communication and support, even if the incident has been passed on to higher-level support.

Level 2 Response

If an incident is beyond Level 1 support’s capacity for diagnosis and quick resolution, it is typically passed on to a Level 2 support team, which will generally be able to bring more resources and experience into play. 

Level 2 teams are also able to call in specialized and third-party support (from manufacturers, vendors, etc.). The basic goal of Level 2 support remains the same as Level 1—to restore service to the customer or client as quickly as possible.

Post-resolution Reporting and Review

The formal ITIL model breaks this down into two processes: Closure and Evaluation, and Incident Management Reporting. For many organizations, particularly smaller ones, it may be more convenient to combine them into a single process.

The key elements of any post-resolution wrap-up are to verify, record, and evaluate the resolution (or lack of one), and to fully report the details of the incident (typically with a post-mortem report). Incident post-mortem reports should be entered into an information base that is available to response teams and managers, and which is sufficiently indexed and searchable to serve as an easily accessible source of information for responding to (and hopefully preventing) future incidents.

Other Key Issues

In addition to the elements listed above, the ITIL model includes two other factors which come into play in any realistic incident lifecycle management system:

Major Incident Handling

Major incidents are typically those which present an immediate, serious threat to the operation or security of basic infrastructure or key services. The objective is still to get the system up and running as quickly as possible, but the priority and initial level of response may be much higher. A major incident may go directly to level 2, to a specialized support team, or even to third-party support (for example, if an important component of the hardware infrastructure breaks down).

Each organization may have its own standards for what constitutes a major incident, but for most organizations, it is important to recognize that major incidents form their own category, with a significantly higher level of priority and response.


Because one of the top priorities of incident management in the ITIL model is to maintain or restore customer service as quickly as possible, the initial resolution may involve workarounds — a rollback, for instance. This is true at all levels. The logic is simple: If you restore customer service now, you’ve solved the immediate problem and the IT or development team can then take as much time as necessary to resolve the underlying issues.

It is important to log and identify all workarounds, both in the incident report system, and when scheduling IT and development updates, because every workaround results in technical debt, the cost of which generally becomes higher the longer it goes unpaid. This means that workarounds resulting from incident response should be replaced with solutions conforming to system design standards as soon as it is practical to do so. In many respects, an incident isn’t fully resolved until any workarounds have been replaced by more permanent solutions.

There really is no need for your incident response team to operate in survival mode from day to day. In a world where it’s never been more expensive to be unprepared for customer-impacting issues, doing so introduces chaos and anxiety into the equation.

With an incident lifecycle management framework tailored to the needs of your organization, you can keep critical applications and infrastructure running with minimal service interruption as well as stress. Implementing the best practice incident lifecycle is the key to reliability, and reliability itself is an indispensable service that will help define your long-term success.

The post Keep Critical Apps and Infrastructure Up and Running appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
Containers are rapidly finding their way into enterprise data centers, but change is difficult. How do enterprises transform their architecture with technologies like containers without losing the reliable components of their current solutions? In his session at @DevOpsSummit at 21st Cloud Expo, Tony Campbell, Director, Educational Services at CoreOS, will explore the challenges organizations are facing today as they move to containers and go over how Kubernetes applications can deploy with lega...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, will provide a fun and simple way to introduce Machine Leaning to anyone and everyone. Together we will solve a machine learning problem and find an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intellige...
Today most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes significant work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reducti...
We all know that end users experience the Internet primarily with mobile devices. From an app development perspective, we know that successfully responding to the needs of mobile customers depends on rapid DevOps – failing fast, in short, until the right solution evolves in your customers' relationship to your business. Whether you’re decomposing an SOA monolith, or developing a new application cloud natively, it’s not a question of using microservices – not doing so will be a path to eventual b...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...
As hybrid cloud becomes the de-facto standard mode of operation for most enterprises, new challenges arise on how to efficiently and economically share data across environments. In his session at 21st Cloud Expo, Dr. Allon Cohen, VP of Product at Elastifile, will explore new techniques and best practices that help enterprise IT benefit from the advantages of hybrid cloud environments by enabling data availability for both legacy enterprise and cloud-native mission critical applications. By rev...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, will lead you through the exciting evolution of the cloud. He'll look at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering ...
SYS-CON Events announced today that Ryobi Systems will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Ryobi Systems Co., Ltd., as an information service company, specialized in business support for local governments and medical industry. We are challenging to achive the precision farming with AI. For more information, visit http:...
Amazon is pursuing new markets and disrupting industries at an incredible pace. Almost every industry seems to be in its crosshairs. Companies and industries that once thought they were safe are now worried about being “Amazoned.”. The new watch word should be “Be afraid. Be very afraid.” In his session 21st Cloud Expo, Chris Kocher, a co-founder of Grey Heron, will address questions such as: What new areas is Amazon disrupting? How are they doing this? Where are they likely to go? What are th...
As you move to the cloud, your network should be efficient, secure, and easy to manage. An enterprise adopting a hybrid or public cloud needs systems and tools that provide: Agility: ability to deliver applications and services faster, even in complex hybrid environments Easier manageability: enable reliable connectivity with complete oversight as the data center network evolves Greater efficiency: eliminate wasted effort while reducing errors and optimize asset utilization Security: imple...
High-velocity engineering teams are applying not only continuous delivery processes, but also lessons in experimentation from established leaders like Amazon, Netflix, and Facebook. These companies have made experimentation a foundation for their release processes, allowing them to try out major feature releases and redesigns within smaller groups before making them broadly available. In his session at 21st Cloud Expo, Brian Lucas, Senior Staff Engineer at Optimizely, will discuss how by using...
The next XaaS is CICDaaS. Why? Because CICD saves developers a huge amount of time. CD is an especially great option for projects that require multiple and frequent contributions to be integrated. But… securing CICD best practices is an emerging, essential, yet little understood practice for DevOps teams and their Cloud Service Providers. The only way to get CICD to work in a highly secure environment takes collaboration, patience and persistence. Building CICD in the cloud requires rigorous ar...
In this strange new world where more and more power is drawn from business technology, companies are effectively straddling two paths on the road to innovation and transformation into digital enterprises. The first path is the heritage trail – with “legacy” technology forming the background. Here, extant technologies are transformed by core IT teams to provide more API-driven approaches. Legacy systems can restrict companies that are transitioning into digital enterprises. To truly become a lead...
Companies are harnessing data in ways we once associated with science fiction. Analysts have access to a plethora of visualization and reporting tools, but considering the vast amount of data businesses collect and limitations of CPUs, end users are forced to design their structures and systems with limitations. Until now. As the cloud toolkit to analyze data has evolved, GPUs have stepped in to massively parallel SQL, visualization and machine learning.
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...