Blog Feed Post

Better Prep Your On-Call Engineer

The on-call engineer has a critical role to play in incident management. They can mean the difference between an incident turning critical or being managed and resolved quickly.

Startups may not have many choices around who should be on call, but as the organization grows and incident management becomes more complex and with higher stakes, it’s important to have a structured process for the on-call engineer. Whether you’re a startup or an enterprise, you can benefit from having a clear process for equipping your on-call engineer to succeed. Here are a few guidelines.

First response is critical

In the first few minutes of the incident occurring, the on-call engineer needs to know the severity and service impact of the incident. Based on that, he or she needs to gauge what are the downstream services that have been affected, as well as who is needed to resolve the incident and how to onboard them quickly. This requires having a working knowledge of how the system functions, so that when something breaks, they are able to identify root cause and what to prioritize working on. The rotation of the on-call engineer should be automatically scheduled. This way, the load is shared, the team optimizes for fairness and accountability, and everyone can handle incidents and don’t lose their touch. Larger teams sometimes may have dedicated incident managers who can initiate the first response. In either case, the primary goal of the on-call engineer is to get the necessary resources looped in to resolve an incident, if they can’t troubleshoot it and fix it themselves.

Have a secondary on-call engineer

You should have a secondary (and probably even tertiary, etc.) on-call engineer as backup. This ensures that nothing falls through the cracks should the first-level responder sleep through the 3am page. This also means that there needs to be a schedule for rotation of roles within the team. Set up automated rules so that the incident notification gets escalated to the backup engineer if there’s no response from the primary engineer.

Ensure your on-call engineer has the required training

Since there’s a lot at stake when an incident occurs, your on-call engineer needs to be able to follow protocol as well as think on the go. He or she needs to understand how to get in touch with different cross-functional stakeholders (from customer support, marketing, PR, etc.) so that remediation status can be communicated externally in an appropriate manner. It is also useful to hand the on-call engineer a checklist or flowchart to follow when incidents occur.

As every minute of downtime can mean thousands of dollars lost, here are the steps an on-call engineer needs to take during an incident as quickly as possible:

Identify & Log

The first step is to identify or detect the incident and make logs. Logging can help you get to the root cause of the issue quickly and provides context for a comprehensive post-mortem of the incident once it’s resolved. Since it’s important to respond to the incident quickly, identifying and logging must also be done quickly and methodically in order to move on to the next step.

Categorize & Prioritize

Due to the vast variety of problems that a team can encounter, it is important to categorize incidents to prevent confusion. Note the number of users affected, the “blast radius” of the issue with respect to affected services, the potential revenue impact, and so on. Prioritizing incidents can help the on-call engineer make a call on whether the incident requires the time and resources of the rest of the team. Minor, less complex incidents should be handled by the engineer alone if possible to save the entire team’s time. Non-actionable alerts should also be suppressed, to further ensure that on-call engineers can focus on what matters.

Notify the Right People

Platforms like PagerDuty and its built in ChatOps and collaboration integrations are best practice for recruiting the relevant people, and bring them together in the right place at the right time. In particular, using specific ChatOps channels/rooms, shared video calls and conferencing, and fixing issues in-context can make a big difference in the speed of resolution and level of business impact. While communicating with team members, it’s also important to be brief and concise in describing the incident to save both yourself and others time. Teams can get distracted with alert overload, and a solution like PagerDuty is imperative to suppress the noise, and surface the signal.


Troubleshooting doesn’t have to happen only when the whole team is notified and present. Even while waiting for their responses, it is vital that first responders like the on-call engineer be able to troubleshoot on the go. Rapid responses can be a lifesaver, much like real life emergency services, where the first few minutes are incredibly important.

Managing and equipping on-call resources is a crucial task for any development or operations team to be successful. Having sufficient backups and well-thought-out processes and plans in place ensure efficiency when things go south. If on-call engineers follows the basic steps outlined above, teams can spend more time creating and innovating, and less time fixing.

The post Better Prep Your On-Call Engineer appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that HTBase will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. HTBase (Gartner 2016 Cool Vendor) delivers a Composable IT infrastructure solution architected for agility and increased efficiency. It turns compute, storage, and fabric into fluid pools of resources that are easily composed and re-composed to meet each application’s needs. With HTBase, companies can quickly prov...
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
What if you could build a web application that could support true web-scale traffic without having to ever provision or manage a single server? Sounds magical, and it is! In his session at 20th Cloud Expo, Chris Munns, Senior Developer Advocate for Serverless Applications at Amazon Web Services, will show how to build a serverless website that scales automatically using services like AWS Lambda, Amazon API Gateway, and Amazon S3. We will review several frameworks that can help you build serverle...
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
Culture is the most important ingredient of DevOps. The challenge for most organizations is defining and communicating a vision of beneficial DevOps culture for their organizations, and then facilitating the changes needed to achieve that. Often this comes down to an ability to provide true leadership. As a CIO, are your direct reports IT managers or are they IT leaders? The hard truth is that many IT managers have risen through the ranks based on their technical skills, not their leadership abi...
The essence of cloud computing is that all consumable IT resources are delivered as services. In his session at 15th Cloud Expo, Yung Chou, Technology Evangelist at Microsoft, demonstrated the concepts and implementations of two important cloud computing deliveries: Infrastructure as a Service (IaaS) and Platform as a Service (PaaS). He discussed from business and technical viewpoints what exactly they are, why we care, how they are different and in what ways, and the strategies for IT to transi...
After more than five years of DevOps, definitions are evolving, boundaries are expanding, ‘unicorns’ are no longer rare, enterprises are on board, and pundits are moving on. Can we now look at an evolution of DevOps? Should we? Is the foundation of DevOps ‘done’, or is there still too much left to do? What is mature, and what is still missing? What does the next 5 years of DevOps look like? In this Power Panel at DevOps Summit, moderated by DevOps Summit Conference Chair Andi Mann, panelists l...
Web Real-Time Communication APIs have quickly revolutionized what browsers are capable of. In addition to video and audio streams, we can now bi-directionally send arbitrary data over WebRTC's PeerConnection Data Channels. With the advent of Progressive Web Apps and new hardware APIs such as WebBluetooh and WebUSB, we can finally enable users to stitch together the Internet of Things directly from their browsers while communicating privately and securely in a decentralized way.
All organizations that did not originate this moment have a pre-existing culture as well as legacy technology and processes that can be more or less amenable to DevOps implementation. That organizational culture is influenced by the personalities and management styles of Executive Management, the wider culture in which the organization is situated, and the personalities of key team members at all levels of the organization. This culture and entrenched interests usually throw a wrench in the work...
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
What sort of WebRTC based applications can we expect to see over the next year and beyond? One way to predict development trends is to see what sorts of applications startups are building. In his session at @ThingsExpo, Arin Sime, founder of WebRTC.ventures, will discuss the current and likely future trends in WebRTC application development based on real requests for custom applications from real customers, as well as other public sources of information,
Historically, some banking activities such as trading have been relying heavily on analytics and cutting edge algorithmic tools. The coming of age of powerful data analytics solutions combined with the development of intelligent algorithms have created new opportunities for financial institutions. In his session at 20th Cloud Expo, Sebastien Meunier, Head of Digital for North America at Chappuis Halder & Co., will discuss how these tools can be leveraged to develop a lasting competitive advanta...