Blog Feed Post

Keep Critical Apps and Infrastructure Up and Running

“Incident lifecycle management? If we manage to stay alive from one incident to the next, it’s a good day. On a bad day, it’s all panic mode.”

Unfortunately, that’s the reality of incident lifecycle management for far too many software and IT companies — but it doesn’t have to be that way. The truth is that genuine, proactive incident lifecycle management can keep incident-response teams from falling into chronic survival or panic mode.

Incident lifecycle management is a framework for categorizing, responding to, resolving, and documenting incidents so that they can be handled effectively with minimal loss of services and with well-organized follow-up. An end-to-end incident resolution framework is crucial for maintaining critical services.

Customer-Centered Incident Management

Most modern incident management systems are based to one degree or another on the ITIL model, first developed in the 1980s by the British government’s Central Computing and Telecommunications Agency. The ITIL model is centered around maintaining services to clients and customers, as opposed to maintaining key systems strictly according to technical specifications. This makes it an ideal model for incident response in outward-facing applications, where maintenance of user services is of high importance. The most important elements of the ITIL model to keep in mind when setting up an incident lifecycle management framework are:

Initial Response

This is the phase during which incoming alerts are logged, categorized, and routed to the appropriate teams. In many respects, this is the most important part of the incident management lifecycle, because it is when you detect issues and filter out noise (non-actionable alerts), set priorities, and determine where each alert should be routed.

Failure to adequately manage this part of the process can result in important alerts being missed, handled at too-low priority, or routed to the wrong responders, as well as unbalanced workloads for response teams.

Level 1 Response

After an alert has been categorized, it is sent to a Level 1 response team. Level 1 teams are the first responders; their job is to resolve the incident to the customer’s satisfaction, typically within a specified time frame. The Level 1 team will investigate the incident, figure out what the basic problem is, and apply known or recommended remediations wherever possible. 

Level 1 support also monitors the status of the incident, particularly with regard to escalation. Another key responsibility of Level 1 support is to maintain communication with the affected customer or client and provide status updates at intervals which may be set by contract, or by organizational policies. This makes it possible to maintain a consistent channel of communication and support, even if the incident has been passed on to higher-level support.

Level 2 Response

If an incident is beyond Level 1 support’s capacity for diagnosis and quick resolution, it is typically passed on to a Level 2 support team, which will generally be able to bring more resources and experience into play. 

Level 2 teams are also able to call in specialized and third-party support (from manufacturers, vendors, etc.). The basic goal of Level 2 support remains the same as Level 1—to restore service to the customer or client as quickly as possible.

Post-resolution Reporting and Review

The formal ITIL model breaks this down into two processes: Closure and Evaluation, and Incident Management Reporting. For many organizations, particularly smaller ones, it may be more convenient to combine them into a single process.

The key elements of any post-resolution wrap-up are to verify, record, and evaluate the resolution (or lack of one), and to fully report the details of the incident (typically with a post-mortem report). Incident post-mortem reports should be entered into an information base that is available to response teams and managers, and which is sufficiently indexed and searchable to serve as an easily accessible source of information for responding to (and hopefully preventing) future incidents.

Other Key Issues

In addition to the elements listed above, the ITIL model includes two other factors which come into play in any realistic incident lifecycle management system:

Major Incident Handling

Major incidents are typically those which present an immediate, serious threat to the operation or security of basic infrastructure or key services. The objective is still to get the system up and running as quickly as possible, but the priority and initial level of response may be much higher. A major incident may go directly to level 2, to a specialized support team, or even to third-party support (for example, if an important component of the hardware infrastructure breaks down).

Each organization may have its own standards for what constitutes a major incident, but for most organizations, it is important to recognize that major incidents form their own category, with a significantly higher level of priority and response.


Because one of the top priorities of incident management in the ITIL model is to maintain or restore customer service as quickly as possible, the initial resolution may involve workarounds — a rollback, for instance. This is true at all levels. The logic is simple: If you restore customer service now, you’ve solved the immediate problem and the IT or development team can then take as much time as necessary to resolve the underlying issues.

It is important to log and identify all workarounds, both in the incident report system, and when scheduling IT and development updates, because every workaround results in technical debt, the cost of which generally becomes higher the longer it goes unpaid. This means that workarounds resulting from incident response should be replaced with solutions conforming to system design standards as soon as it is practical to do so. In many respects, an incident isn’t fully resolved until any workarounds have been replaced by more permanent solutions.

There really is no need for your incident response team to operate in survival mode from day to day. In a world where it’s never been more expensive to be unprepared for customer-impacting issues, doing so introduces chaos and anxiety into the equation.

With an incident lifecycle management framework tailored to the needs of your organization, you can keep critical applications and infrastructure running with minimal service interruption as well as stress. Implementing the best practice incident lifecycle is the key to reliability, and reliability itself is an indispensable service that will help define your long-term success.

The post Keep Critical Apps and Infrastructure Up and Running appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
Regardless of what business you’re in, it’s increasingly a software-driven business. Consumers’ rising expectations for connected digital and physical experiences are driving what some are calling the "Customer Experience Challenge.” In his session at @DevOpsSummit at 20th Cloud Expo, Marco Morales, Director of Global Solutions at CollabNet, will discuss how organizations are increasingly adopting a discipline of Value Stream Mapping to ensure that the software they are producing is poised to o...
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @CloudExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.
For financial firms, the cloud is going to increasingly become a crucial part of dealing with customers over the next five years and beyond, particularly with the growing use and acceptance of virtual currencies. There are new data storage paradigms on the horizon that will deliver secure solutions for storing and moving sensitive financial data around the world without touching terrestrial networks. In his session at 20th Cloud Expo, Cliff Beek, President of Cloud Constellation Corporation, w...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that EARP Integration will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. EARP Integration is a passionate software house. Since its inception in 2009 the company successfully delivers smart solutions for cities and factories that start their digital transformation. EARP provides bespoke solutions like, for example, advanced enterprise portals, business intelligence systems an...
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 20th Cloud Expo, Tom Eck, Industry Platforms CTO at IBM Cloud, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing developers ...
SYS-CON Events announced today that Outscale, a global pure play Infrastructure as a Service provider and strategic partner of Dassault Systèmes, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2010, Outscale simplifies infrastructure complexities and boosts the business agility of its customers. Outscale delivers a secure, reliable and industrial strength solution for its customers, which in...
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs oft...
Interested in leveling up on your Cloud Foundry skills? Join IBM for Cloud Foundry Days on June 7 at Cloud Expo New York at the Javits Center in New York City. Cloud Foundry Days is a free half day educational conference and networking event. Come find out why Cloud Foundry is the industry's fastest-growing and most adopted cloud application platform.
In order to meet the rapidly changing demands of today’s customers, companies are continually forced to redefine their business strategies in order to meet these needs, stay relevant and continue to see profitable growth. IoT deployment and development is integral in this transformation, and today businesses are increasingly seeing the value of investing their resources into IoT deployments. These technologies are able increase ROI through projects such as connecting supply chains or enabling sm...
Most DevOps journeys involve several phases of maturity. Research shows that the inflection point where organizations begin to see maximum value is when they implement tight integration deploying their code to their infrastructure. Success at this level is the last barrier to at-will deployment. Storage, for instance, is more capable than where we read and write data. In his session at @DevOpsSummit at 20th Cloud Expo, Josh Atwell, a Developer Advocate for NetApp, will discuss the role and value...
As cloud adoption continues to transform business, today's global enterprises are challenged with managing a growing amount of information living outside of the data center. The rapid adoption of IoT and increasingly mobile workforce are exacerbating the problem. Ensuring secure data sharing and efficient backup poses capacity and bandwidth considerations as well as policy and regulatory compliance issues.
SYS-CON Events announced today that Cloud Academy will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud computing technologies. Ge...
When NSA's digital armory was leaked, it was only a matter of time before the code was morphed into a ransom seeking worm. This talk, designed for C-level attendees, demonstrates a Live Hack of a virtual environment to show the ease in which any average user can leverage these tools and infiltrate their network environment. This session will include an overview of the Shadbrokers NSA leak situation.
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.