Welcome!

Blog Feed Post

Top Ten Practices of Highly Effective DevOps Incident Management Teams

I recently presented a webinar with DevOps.com about the behaviors we see in teams who represent the leading edge of Incident Management. Using the Incident Management Lifecycle as a jumping off point, we explored 10 tips that nest into each of the 5 phases of an incidents’ lifecycle. Depending on a teams’ relative maturity, these ideas may represent anything from a starry eyed daydream to an example of your normal operating practice.

A recording of the presentation, polls, and Q&A can be viewed here. I’ll explore the topics discussed further below.

Incident Management Lifecycle

As we’ve discussed before, we like to break up conversations around incident management into five phases. This framework gives us a common language to discuss improvements that teams can adopt to make on-call suck less.

https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 300w, https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 768w, https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 510w, https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 1026w" sizes="(max-width: 640px) 100vw, 640px" />

Most people immediately recognize the first 3 phases, that’s certainly where all the action and adventure lives in on-call. That said, the last two are the most important for a team if reducing time-to-resolution is their focus. Analysis and Readiness is where teams focus on learning and enacting improvements. Without significant, ongoing focus on these phases, teams will rarely manage to get out of a break/fix cycle.

The Zeroth Tip

Iteration. I cannot overstate that the best thing you can do to get better at Incident Management is to keep trying to get better at Incident Management. In most environments, the reality of on-call is a kind of foggy staggering between outages. A large event will occur, after which teams will scurry around trying to implement changes to Detection, Response, Remediation, and systems all at once. A harried two weeks will come to a close with many ideas half implemented, then everyone returns to their day job of programming or systems. No further work is performed to get things better until the next big outage.

This is not a recipe for success.

Every iteration, every week, some percentage of the roadmap must be devoted to ongoing improvements to the systems, process, and people behind your on-call rotations.

Detection

Tip #1 Take a Blended Approach to Detection

Simple, static monitoring never provides a sufficient picture of system or application health. The best teams have a mature blend of Static, Synthetic, APM, Time Series, and Log Analysis systems in place.

While a blended approach gives far more insight and intelligence about what’s going on, teams must be wary of alert fatigue from too many things going beep.

Tip #2 Focus on Business Outcomes

While the aggregate network throughput on the outside interface of your load balancer is an interesting metric, it means little compared to midday revenue. Why has engineering moved so far away from business outcomes? The best teams use the businesses’ core metrics as a means to detect application health, and respond in kind to changes or variance beyond expected norms.

This expanded view of application health encourages Incident Management teams to consider the multitude of inputs (social media, NPM, etc) available for them. The best Incident Management teams are multidisciplinary, they work closely with many organizational elements in the business.

Response

Tip #3 Keep alerts actionable

So much has been said about this, you’d think we could all move on. However, the number one thing I still hear in casual conversations is how alert storms, and un-actionable nonsense is ruining a teams’ morale.

Actionability means both that the alert requires action (instead of being purely informational), and that the alert has been delivered to someone with the permission and ability to perform said action.

Tip #4 Start or grow your ChatOps practice

At any level of adoption, ChatOps is a game changer for teams. Particularly as teams move to fully integrated ChatOps environments the benefits to all phases of the lifecycle are manifest. At the most basic level of adoption, ChatOps creates a common, time-indexed and searchable record of the firefight. These transcripts are useful for others joining the action getting up to speed, and in the Analysis phase of after-action discussions.

An interesting poll result from the webinar shows that, while adoption is growing, ChatOps remains an opportunity for a lot of teams. 25% of respondents indicated no use of ChatOps in their teams.

https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 300w, https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 768w, https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 905w, https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 510w, https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 1069w" sizes="(max-width: 640px) 100vw, 640px" />

Remediation

Tip #5 Runbooks are central to remediation

As engineering teams continue to adopt complex technology systems, helping incident responders understand the what of a system becomes a critical area of focus. Getting paged for the Nth microservice created this week is great… if it’s your service. Cross training for every application is hardly practical, so the best teams rely on runbooks as a the bridge to establish context for on-call teams.

The best runbooks have several characteristics:

-clearly explain metrics and alerts
-clearly explain application or system role
-identify upstream and downstream dependencies
-identify an escalation point or Subject Matter Expert
-enumerate known failure states or symptoms
-list (through integration) recent work or incidents
are routinely updated

Tip #6 Adopt infrastructure-as-code

It’s tough to call this a tip given the work involved for any team to move from older methods of maintaining infrastructure. Adopting configuration or infrastructure as code represents a multi-year project for nearly any established business.

The difficulty of adopting this approach aside, teams who operate infrastructure in the same workstream as developing code are a breed apart. Consolidated workstreams create a step function in efficiency and adaptability for any DevOps team dealing with outages. Full transparency into all state changes in systems lead to quicker diagnosis, and the ease with which these teams actualize changes are doubly impactful to remediation efforts.

Analysis

Tip #7 Data drives investigation

If, like me, you’re tired of hearing about “data driven” things, you can mentally rewrite this tip title to read “Rigorous methodology drives investigation”. The best teams keep after-action analysis focused on clean observation, testable hypothesis, clear success criteria, and an iterative approach to learning. At their best, these approaches look more like the Scientific Method than anything else.

https://victorops.com/wp-content/uploads/2017/06/The_Scientific_Method_a... 300w" sizes="(max-width: 450px) 100vw, 450px" />

Moreover, these teams discuss the impact of cognitive biases on their work. Objectivity, rigor, and defensible analysis rule the day.

Tip #8 Keep postmortems blameless

Much like #3 above, this advice almost seems like piling on – who has been at a conference in the past year where at least one “blameless postmortem” talk was given? The number of people advocating it though, is an indication of how utterly required it is for any on-call team. Blameful culture cripples responders as they are less likely to act in the moment, and more likely to hide information after the fact.

The best teams create a culture where responders are empowered to act, expected to act, and rewarded for taking smart action. A postmortem focused on learning as much as possible from an event is one way to help foster a culture of action in your team.

Readiness

Tip #9 Keep postmortems actionable

The most objective, learning-focused, rigorous Analysis phase imaginable is worth little if no action is taken. Far too often teams go through the motions of after-action discussion, then lose track of the ideas, lose time to implement, or lose focus on those ideas through lack of leadership.

https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 300w, https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 768w, https://victorops.com/wp-content/uploads/2017/06/Screen-Shot-2017-06-12-... 510w" sizes="(max-width: 898px) 100vw, 898px" />

Improvements to Incident Management are as important, less important, or equally important as feature requests. Who knows which? Keeping Product Managers involved in your readiness will enable clear discussions about the tradeoffs with other work being planned. Adding those improvements to the same workstreams will not ensure they are actioned, but it does help you create a running backlog of desired improvements that can be added to longer term roadmap planning.

Tip #10 Organize the swarm

One of the best questions from the Q&A time at this event was (paraphrased) “How can you help a team deal with the unknown or unexpected?”. This is the heart of really everything we do. Things are going to break. They are probably going to break badly. They will break in a way you have only vaguely imagined… how can we be ready for the unknown?

The short answer is you can’t! What you can (and should) do instead is focus on setting a group of smart people up for success. Give them tools, give them clear roles and organization, and let them be smart. They’ll figure it out, as long as they aren’t also trying to figure out organization and communication at the same time.

Keep Iterating

Any one of these ideas may represent months of work for a team. None of them are going to solve all your on-call problems the first time you try it. By adopting a continuous improvement mindset, Incident Management teams can implement small changes frequently, and start walking the path to being a highly effective team. These ideas and more are explored in the new ebook The Dev and Ops Guide to Incident Management, which you can download here.

The post Top Ten Practices of Highly Effective DevOps Incident Management Teams appeared first on VictorOps.

Read the original blog entry...

More Stories By VictorOps Blog

VictorOps is making on-call suck less with the only collaborative alert management platform on the market.

With easy on-call scheduling management, a real-time incident timeline that gives you contextual relevance around your alerts and powerful reporting features that make post-mortems more effective, VictorOps helps your IT/DevOps team solve problems faster.

Latest Stories
SYS-CON Events announced today that SourceForge has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SourceForge is the largest, most trusted destination for Open Source Software development, collaboration, discovery and download on the web serving over 32 million viewers, 150 million downloads and over 460,000 active development projects each and every month.
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
Today most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes significant work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reducti...
SYS-CON Events announced today that Daiya Industry will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Daiya Industry specializes in orthotic support systems and assistive devices with pneumatic artificial muscles in order to contribute to an extended healthy life expectancy. For more information, please visit https://www.daiyak...
SYS-CON Events announced today that Nihon Micron will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Nihon Micron Co., Ltd. strives for technological innovation to establish high-density, high-precision processing technology for providing printed circuit board and metal mount RFID tags used for communication devices. For more inf...
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
SYS-CON Events announced today that Suzuki Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Suzuki Inc. is a semiconductor-related business, including sales of consuming parts, parts repair, and maintenance for semiconductor manufacturing machines, etc. It is also a health care business providing experimental research for...
"Our strategy is to focus on the hyperscale providers - AWS, Azure, and Google. Over the last year we saw that a lot of developers need to learn how to do their job in the cloud and we see this DevOps movement that we are catering to with our content," stated Alessandro Fasan, Head of Global Sales at Cloud Academy, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
Many organizations adopt DevOps to reduce cycle times and deliver software faster; some take on DevOps to drive higher quality and better end-user experience; others look to DevOps for a clearer line-of-sight to customers to drive better business impacts. In truth, these three foundations go together. In this power panel at @DevOpsSummit 21st Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, industry experts will discuss how leading organizations build application success from all...
SYS-CON Events announced today that mruby Forum will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. mruby is the lightweight implementation of the Ruby language. We introduce mruby and the mruby IoT framework that enhances development productivity. For more information, visit http://forum.mruby.org/.
Cloud-based disaster recovery is critical to any production environment and is a high priority for many enterprise organizations today. Nearly 40% of organizations have had to execute their BCDR plan due to a service disruption in the past two years. Zerto on IBM Cloud offer VMware and Microsoft customers simple, automated recovery of on-premise VMware and Microsoft workloads to IBM Cloud data centers.
Elon Musk is among the notable industry figures who worries about the power of AI to destroy rather than help society. Mark Zuckerberg, on the other hand, embraces all that is going on. AI is most powerful when deployed across the vast networks being built for Internets of Things in the manufacturing, transportation and logistics, retail, healthcare, government and other sectors. Is AI transforming IoT for the good or the bad? Do we need to worry about its potential destructive power? Or will we...
Why Federal cloud? What is in Federal Clouds and integrations? This session will identify the process and the FedRAMP initiative. But is it sufficient? What is the remedy for keeping abreast of cutting-edge technology? In his session at 21st Cloud Expo, Rasananda Behera will examine the proposed solutions: Private or public or hybrid cloud Responsible governing bodies How can we accomplish?