Welcome!

Blog Feed Post

5 Incident Management Tools You Need During a Firefight

It’s critical to have the right tools in place before a firefight happens. A lack of proper tooling makes it significantly more difficult to recognize, organize, fight, and resolve a major outage. This is especially true when teams are busy fighting rather than communicating to internal and external stakeholders. If best practices have been established ahead of time, a difficult incident can be handled much more smoothly.

The following is not an exhaustive list of domains to plan prior to an outage, but they will greatly improve your organization’s ability to coordinate and be prepared for any issue.

1. Internal Communications

Internal communication will commonly take place in email. This is problematic for a number of reasons. Email is a one-to-one medium. It defaults closed, meaning it is only readable for the sender and receiver(s), and is inherently bulky and difficult to parse through when quick status information is needed. Persistent collaboration environments like Slack and HipChat provide an externally hosted location to disseminate information. Both of these platforms also provide public, optional subscribe, topical channels that can be used to disseminate information. At the critical level, status updates (or messaging that the issue is already known and being worked on) can be provided to key staff (support, leadership) in near real-time.

2. Application Performance and Infrastructure Monitoring

Ideally, the team will know there is an issue with an application before the customer does. Application and infrastructure monitoring technology can help ensure this is the case and can provide valuable information in the midst of the outage as to whether a fix or update is working as it should (New Relic for application monitoring and AWS CloudWatch are two such technologies). It is also important to monitor both application performance and infrastructure performance, and (ideally) link the two together, with a solution such as PagerDuty, to consolidate all service health data into a single view and notify the on-call resource if any issue requires urgent action. It is much easier to troubleshoot an issue if you have visibility to both layers and can identify the root cause.

3. Status Updates

When there is a performance issue, support teams will be inundated with requests for updates. Key ways to mitigate this influx are via Twitter, a status page, or to engage business stakeholders with a product like PagerDuty. These are separate from your primary infrastructure and should be resilient to even site-wide outages. On Twitter, users can easily look for pinned tweets and recent replies if they are having an issue. Users can also check statusapp.com for any “yellow” or “red” statuses. An easy-to-read status page like the one from statuspage.io is a critical component to disseminate information to your customers during an outage. A user will build trust in the page if it is accurate and includes updates for minor disruptions — and in that way, they also build more trust in your business. It should also contain updates when an issue is undergoing troubleshooting, and include status for each major subcomponent. These updates should be available within minutes, for complete visibility. Finally, with capabilities like PagerDuty’s Stakeholder Engagement, any incident responder can easily send out a status update that reaches predefined groups of business stakeholders via any preferred notification channel — phone, SMS, email, or push notification. Stakeholders can also subscribe to incident status updates to get real-time information on any issue that is customer-impacting.

4. Ticketing Solution

A ticketing solution like ZenDesk is absolutely critical to managing an outage. A significant outage can be highly disruptive and forfeit substantial goodwill. A ticket management system will help to identify intermittent issues an application monitor may have missed. It will also track and disseminate information relative to an influx of support requests. Workflows for issue escalation will raise potential issues more quickly than relying on individual judgement, especially on larger support teams. Ready-made message templates will help keep messaging consistent and accurate during an outage, and “related to” tags will also make it easier to debrief an issue once it has been resolved.

5. Procedure Tracking

With proper procedures in place, an organization can anticipate issues that are likely to arise from their applications. These scenarios should be documented ahead of time. Troubleshooting, mitigation, and remediation information should be documented and surfaced for the team. The procedure can also include a checklist of duties — one that lays out who does what, and includes emergency numbers and who is on-call. If resources are available, a tabletop exercise of a mock outage is extremely helpful in identifying gaps before a major outage occurs. Then after a firefight has occurred, debrief with the team in a post-mortem and improve your procedures. There will be another outage, and any additional information you can add to your process will speed recovery. As with the other above items, it is possible your local architecture will become unavailable, so storing these procedures on an externally hosted repository, or automating it with a solution such as PagerDuty, is preferred.

These tools are only an initial list. Their effectiveness in an outage is only as valuable as the time that was spent to properly configure and understand them ahead of time. Communicating with both internal and external stakeholders is key in any firefight, as much within IT as in any other function or industry.

The post 5 Incident Management Tools You Need During a Firefight appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
Regardless of what business you’re in, it’s increasingly a software-driven business. Consumers’ rising expectations for connected digital and physical experiences are driving what some are calling the "Customer Experience Challenge.” In his session at @DevOpsSummit at 20th Cloud Expo, Marco Morales, Director of Global Solutions at CollabNet, will discuss how organizations are increasingly adopting a discipline of Value Stream Mapping to ensure that the software they are producing is poised to o...
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @CloudExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.
For financial firms, the cloud is going to increasingly become a crucial part of dealing with customers over the next five years and beyond, particularly with the growing use and acceptance of virtual currencies. There are new data storage paradigms on the horizon that will deliver secure solutions for storing and moving sensitive financial data around the world without touching terrestrial networks. In his session at 20th Cloud Expo, Cliff Beek, President of Cloud Constellation Corporation, w...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that EARP Integration will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. EARP Integration is a passionate software house. Since its inception in 2009 the company successfully delivers smart solutions for cities and factories that start their digital transformation. EARP provides bespoke solutions like, for example, advanced enterprise portals, business intelligence systems an...
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 20th Cloud Expo, Tom Eck, Industry Platforms CTO at IBM Cloud, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing developers ...
SYS-CON Events announced today that Outscale, a global pure play Infrastructure as a Service provider and strategic partner of Dassault Systèmes, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2010, Outscale simplifies infrastructure complexities and boosts the business agility of its customers. Outscale delivers a secure, reliable and industrial strength solution for its customers, which in...
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs oft...
Interested in leveling up on your Cloud Foundry skills? Join IBM for Cloud Foundry Days on June 7 at Cloud Expo New York at the Javits Center in New York City. Cloud Foundry Days is a free half day educational conference and networking event. Come find out why Cloud Foundry is the industry's fastest-growing and most adopted cloud application platform.
In order to meet the rapidly changing demands of today’s customers, companies are continually forced to redefine their business strategies in order to meet these needs, stay relevant and continue to see profitable growth. IoT deployment and development is integral in this transformation, and today businesses are increasingly seeing the value of investing their resources into IoT deployments. These technologies are able increase ROI through projects such as connecting supply chains or enabling sm...
Most DevOps journeys involve several phases of maturity. Research shows that the inflection point where organizations begin to see maximum value is when they implement tight integration deploying their code to their infrastructure. Success at this level is the last barrier to at-will deployment. Storage, for instance, is more capable than where we read and write data. In his session at @DevOpsSummit at 20th Cloud Expo, Josh Atwell, a Developer Advocate for NetApp, will discuss the role and value...
As cloud adoption continues to transform business, today's global enterprises are challenged with managing a growing amount of information living outside of the data center. The rapid adoption of IoT and increasingly mobile workforce are exacerbating the problem. Ensuring secure data sharing and efficient backup poses capacity and bandwidth considerations as well as policy and regulatory compliance issues.
SYS-CON Events announced today that Cloud Academy will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud computing technologies. Ge...
When NSA's digital armory was leaked, it was only a matter of time before the code was morphed into a ransom seeking worm. This talk, designed for C-level attendees, demonstrates a Live Hack of a virtual environment to show the ease in which any average user can leverage these tools and infiltrate their network environment. This session will include an overview of the Shadbrokers NSA leak situation.
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @ThingsExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.