Blog Feed Post

Better Prep Your On-Call Engineer

The on-call engineer has a critical role to play in incident management. They can mean the difference between an incident turning critical or being managed and resolved quickly.

Startups may not have many choices around who should be on call, but as the organization grows and incident management becomes more complex and with higher stakes, it’s important to have a structured process for the on-call engineer. Whether you’re a startup or an enterprise, you can benefit from having a clear process for equipping your on-call engineer to succeed. Here are a few guidelines.

First response is critical

In the first few minutes of the incident occurring, the on-call engineer needs to know the severity and service impact of the incident. Based on that, he or she needs to gauge what are the downstream services that have been affected, as well as who is needed to resolve the incident and how to onboard them quickly. This requires having a working knowledge of how the system functions, so that when something breaks, they are able to identify root cause and what to prioritize working on. The rotation of the on-call engineer should be automatically scheduled. This way, the load is shared, the team optimizes for fairness and accountability, and everyone can handle incidents and don’t lose their touch. Larger teams sometimes may have dedicated incident managers who can initiate the first response. In either case, the primary goal of the on-call engineer is to get the necessary resources looped in to resolve an incident, if they can’t troubleshoot it and fix it themselves.

Have a secondary on-call engineer

You should have a secondary (and probably even tertiary, etc.) on-call engineer as backup. This ensures that nothing falls through the cracks should the first-level responder sleep through the 3am page. This also means that there needs to be a schedule for rotation of roles within the team. Set up automated rules so that the incident notification gets escalated to the backup engineer if there’s no response from the primary engineer.

Ensure your on-call engineer has the required training

Since there’s a lot at stake when an incident occurs, your on-call engineer needs to be able to follow protocol as well as think on the go. He or she needs to understand how to get in touch with different cross-functional stakeholders (from customer support, marketing, PR, etc.) so that remediation status can be communicated externally in an appropriate manner. It is also useful to hand the on-call engineer a checklist or flowchart to follow when incidents occur.

As every minute of downtime can mean thousands of dollars lost, here are the steps an on-call engineer needs to take during an incident as quickly as possible:

Identify & Log

The first step is to identify or detect the incident and make logs. Logging can help you get to the root cause of the issue quickly and provides context for a comprehensive post-mortem of the incident once it’s resolved. Since it’s important to respond to the incident quickly, identifying and logging must also be done quickly and methodically in order to move on to the next step.

Categorize & Prioritize

Due to the vast variety of problems that a team can encounter, it is important to categorize incidents to prevent confusion. Note the number of users affected, the “blast radius” of the issue with respect to affected services, the potential revenue impact, and so on. Prioritizing incidents can help the on-call engineer make a call on whether the incident requires the time and resources of the rest of the team. Minor, less complex incidents should be handled by the engineer alone if possible to save the entire team’s time. Non-actionable alerts should also be suppressed, to further ensure that on-call engineers can focus on what matters.

Notify the Right People

Platforms like PagerDuty and its built in ChatOps and collaboration integrations are best practice for recruiting the relevant people, and bring them together in the right place at the right time. In particular, using specific ChatOps channels/rooms, shared video calls and conferencing, and fixing issues in-context can make a big difference in the speed of resolution and level of business impact. While communicating with team members, it’s also important to be brief and concise in describing the incident to save both yourself and others time. Teams can get distracted with alert overload, and a solution like PagerDuty is imperative to suppress the noise, and surface the signal.


Troubleshooting doesn’t have to happen only when the whole team is notified and present. Even while waiting for their responses, it is vital that first responders like the on-call engineer be able to troubleshoot on the go. Rapid responses can be a lifesaver, much like real life emergency services, where the first few minutes are incredibly important.

Managing and equipping on-call resources is a crucial task for any development or operations team to be successful. Having sufficient backups and well-thought-out processes and plans in place ensure efficiency when things go south. If on-call engineers follows the basic steps outlined above, teams can spend more time creating and innovating, and less time fixing.

The post Better Prep Your On-Call Engineer appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
@DevOpsSummit at Cloud Expo taking place Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center, Santa Clara, CA, is co-located with the 21st International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is ...
After more than five years of DevOps, definitions are evolving, boundaries are expanding, ‘unicorns’ are no longer rare, enterprises are on board, and pundits are moving on. Can we now look at an evolution of DevOps? Should we? Is the foundation of DevOps ‘done’, or is there still too much left to do? What is mature, and what is still missing? What does the next 5 years of DevOps look like? In this Power Panel at DevOps Summit, moderated by DevOps Summit Conference Chair Andi Mann, panelists loo...
Cloud applications are seeing a deluge of requests to support the exploding advanced analytics market. “Open analytics” is the emerging strategy to deliver that data through an open data access layer, in the cloud, to be directly consumed by external analytics tools and popular programming languages. An increasing number of data engineers and data scientists use a variety of platforms and advanced analytics languages such as SAS, R, Python and Java, as well as frameworks such as Hadoop and Spark...
"MobiDev is a Ukraine-based software development company. We do mobile development, and we're specialists in that. But we do full stack software development for entrepreneurs, for emerging companies, and for enterprise ventures," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
A look across the tech landscape at the disruptive technologies that are increasing in prominence and speculate as to which will be most impactful for communications – namely, AI and Cloud Computing. In his session at 20th Cloud Expo, Curtis Peterson, VP of Operations at RingCentral, highlighted the current challenges of these transformative technologies and shared strategies for preparing your organization for these changes. This “view from the top” outlined the latest trends and developments i...
Automation is enabling enterprises to design, deploy, and manage more complex, hybrid cloud environments. Yet the people who manage these environments must be trained in and understanding these environments better than ever before. A new era of analytics and cognitive computing is adding intelligence, but also more complexity, to these cloud environments. How smart is your cloud? How smart should it be? In this power panel at 20th Cloud Expo, moderated by Conference Chair Roger Strukhoff, paneli...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
SYS-CON Events announced today that TMC has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo and Big Data at Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Global buyers rely on TMC’s content-driven marketplaces to make purchase decisions and navigate markets. Learn how we can help you reach your marketing goals.
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to...
Managing mission-critical SAP systems and landscapes has never been easy. Add public cloud with its myriad of powerful cloud native services and this may not change any time soon. Public cloud offers exciting new possibilities for enterprise workloads. But to make use of these possibilities and capabilities, IT teams need to re-think everything they have done before. Otherwise, they will just end up using public cloud as a hosting platform for their workloads, aka known as “lift and shift.”
Cloud promises the agility required by today’s digital businesses. As organizations adopt cloud based infrastructures and services, their IT resources become increasingly dynamic and hybrid in nature. Managing these require modern IT operations and tools. In his session at 20th Cloud Expo, Raj Sundaram, Senior Principal Product Manager at CA Technologies, will discuss how to modernize your IT operations in order to proactively manage your hybrid cloud and IT environments. He will be sharing bes...
SYS-CON Events announced today that TechTarget has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TechTarget storage websites are the best online information resource for news, tips and expert advice for the storage, backup and disaster recovery markets.
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara California. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infras...