Blog Feed Post

5 Ways to Improve Team Health With Effective On-Call Handoffs

You code it, you own it” means engineers are called when the software and systems they’ve built fail in production and it’s their responsibility to get everything working again. However, managers and business stakeholders aren’t usually on-call so they don’t see or feel the pain of being paged. This can lead to work prioritization decisions that lack empathy and fail to take into account the responsibility we all have for operational resiliency. Managers push for delivery of new features and higher output over work that addresses operational pain. The engineers see problems and feel powerless to solve them. Over time this conflict results in expensive outages that hurt the team, the business, and customers.

Small issues are usually an early warning sign of more serious problems. If they’re fixed as soon as they arise, bigger problems can be avoided in the long run and your team and customers stay happy.

So, how do we get proactive and make fixing operational problems a habit? Empowering the team with effective on-call handoff sessions is a great place to start!

When our on-call team members go off duty and hand the baton to their teammates, we use this time to expose operational problems, discuss solutions, and empower the team to initiate action. Here are a few tips for effective on-call handoff sessions based on my experience of being on-call at a number of companies, including PagerDuty.

1. Make On-Call Handoffs a Ritual

It’s easy to miss problems engineers are facing when they’re on-call if the team only talks about operational problems in engineering chat rooms. We have regular, dedicated handoff sessions to encourage reflection and create a bias for proactive action to address root cause. Our schedules usually change once a week so the meeting coincides with the day of the changeover.

2. Increase Empathy by Inviting Non-Engineers

Being on-call and waking up to incidents can be disruptive and stressful. We include other stakeholders in the on-call handoff meeting to build a sense of camaraderie and empathy, which ultimately leads to better decision making across the organization.

Our product managers benefit from understanding the impact of operational pain on engineers and customers. Exposure to hand-off sessions allows PMs to hear the impact of their prioritization decisions and ensure both product and technical initiatives are moved forward during work planning sessions.

The goal of engineering leaders is to foster a team culture where individuals are happy, motivated, creative and engaged. By observing on-call handoff sessions and carefully listening to concerns, people managers get exposure to insights that may not be uncovered in team/one-on-one meetings. Following the session, leaders can take action to provide support and resources. Encouraging engineers to take well-deserved time off or helping prioritize the team’s technical/operational recommendations are two examples.

3. Embrace Observability by Reviewing Metrics During the Handoff

It’s easy for teams to get accustomed to disruption when it builds up gradually over time; especially if no one is taking a holistic view and noticing worrying patterns. By reviewing metrics during the handoff session, a culture of observability is promoted that allows the team to see the true picture of operational health — both infrastructural health and human health.

Here are metrics and tools we’ve found useful during our handoff sessions:

Team disruption statistics: PagerDuty provides valuable data and graphs showing total incidents by service, team, and user. Comparing counts at each review allows us to reflect on patterns and discuss solutions.

Chat history: By using chat integration (Slack, Hipchat etc.), all incident notifications can be sent to a dedicated channel. Our engineers chat in the same channel as the incident notifications so it’s easy to identify and analyze conversation threads showing trending topics and concerns.

Use PagerDuty’s Public APIs to create custom reports and apps: Using PagerDuty’s APIs supports the creation of reports and apps that can be tailored to your business. For example, we’ve created an extension that gives an instant picture of how much out-of-hours disruption the on-call team members have had based on the time of day and frequency of high-priority incidents. By sharing this view across the team in the handoff session, we see a picture of team health that motivates us to take action.

4. Take Action to Improve

Create tasks, experiment, review, and adjust

Areas of concern that are uncovered during the on-call hand-off sessions must be followed up with concrete actions. PagerDuty’s Jira integration makes it easy to quickly track unplanned work from right inside an incident. It’s then just a short step to assign this work to the on-call engineer (see next section “Reinforce expectations for on-call duties” to understand how this works).

If improvements are noted and correlated back to concrete actions, it’s much more likely those improvements will happen.

Remember to review the result of changes in subsequent on-call handover sessions and adjust your approach based on what was learned.

5. Reinforce Expectations for On-Call Best Practice

Many teams fall into the trap of failing to set clear expectations of on-call and see it as just ‘part of the job’ rather than a dedicated, critical role. How can you stay out of this trap? We set clear expectations:

  1. During their on-call shift, engineers will dedicate time to investigating and fixing the root cause of operational problems as a priority.
  2. Picking up new feature work should be a luxury, not an expectation.
  3. After a disruptive night or weekend, on-call engineers are expected to take a break and have time to recover.

At the on-call handover session, it’s important to check in on these expectations and reinforce the message: Operational improvement requires effort: humans need time and space to be able to focus on it. They also need downtime and a workload that is sustainable.

For more advice on best practice for being on-call, check out our On-Call Survival Guide.

Having engineers on-call is an effective way to encourage continuous improvement and system stability. However, it only works if everyone in the organization understands how to play their part in making it successful. Even if you are not an engineer, your decisions are likely to have unintended side effects on the well-being of engineers and the systems they’re building. Getting involved in on-call handoff sessions and encouraging proactive resolution of problems leads to happy teams and successful products. I encourage you to look at your own organization and reflect on ways you can build empathy across teams using similar techniques. Share your ideas and suggestions in our Community forum!

The post 5 Ways to Improve Team Health With Effective On-Call Handoffs appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
"Infoblox does DNS, DHCP and IP address management for not only enterprise networks but cloud networks as well. Customers are looking for a single platform that can extend not only in their private enterprise environment but private cloud, public cloud, tracking all the IP space and everything that is going on in that environment," explained Steve Salo, Principal Systems Engineer at Infoblox, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventio...
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, provided a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to oper...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
In his session at 21st Cloud Expo, James Henry, Co-CEO/CTO of Calgary Scientific Inc., introduced you to the challenges, solutions and benefits of training AI systems to solve visual problems with an emphasis on improving AIs with continuous training in the field. He explored applications in several industries and discussed technologies that allow the deployment of advanced visualization solutions to the cloud.
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
Agile has finally jumped the technology shark, expanding outside the software world. Enterprises are now increasingly adopting Agile practices across their organizations in order to successfully navigate the disruptive waters that threaten to drown them. In our quest for establishing change as a core competency in our organizations, this business-centric notion of Agile is an essential component of Agile Digital Transformation. In the years since the publication of the Agile Manifesto, the conn...
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
"ZeroStack is a startup in Silicon Valley. We're solving a very interesting problem around bringing public cloud convenience with private cloud control for enterprises and mid-size companies," explained Kamesh Pemmaraju, VP of Product Management at ZeroStack, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Enterprises are adopting Kubernetes to accelerate the development and the delivery of cloud-native applications. However, sharing a Kubernetes cluster between members of the same team can be challenging. And, sharing clusters across multiple teams is even harder. Kubernetes offers several constructs to help implement segmentation and isolation. However, these primitives can be complex to understand and apply. As a result, it’s becoming common for enterprises to end up with several clusters. Thi...
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.