Welcome!

Blog Feed Post

ChaosCat: Automating Fault Injection at PagerDuty

“Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” — Principles of Chaos Engineering

Netflix, Dropbox, and Twilio are all examples of companies that perform this kind of engineering. It’s essential to have confidence in large, robust, distributed systems. At PagerDuty, we’ve been performing controlled fault injection into our production infrastructure for several years. As time has passed, and our infrastructure has grown, our Chaos Engineering practices have evolved as well. One somewhat recent addition is an automated fault injector, which we call ChaosCat.

Background

In the beginning, the SRE team at PagerDuty specifically chose to inject failures into our infrastructure manually, via SSH’ing and executing commands per-host. This allowed us to have precise control over the fault, quickly learn and investigate issues that arose, and avoid heavy upfront investment in tooling. This worked well for a while and allowed us to build up a library of well-understood and repeatable chaos attacks such as high network latency, high CPU usage, host restarts, etc.

We knew doing things manually wouldn’t scale up, so as time went on we began to automate portions of the process. First, the individual commands were turned into scripts, then automated dispatching them to hosts instead of SSH’ing, and on and on. Once individual teams started to own their own services at PagerDuty, this tooling enabled them to do to their own fault injection without needing a central SRE team.

However, early on we had chosen to make the process of injecting faults known ahead of time to individual service owners. This meant that every Friday, those owners would be at least somewhat aware of what to look for. Which meant they’d have a head start on fixing any problems.

The real world rarely gives advance notice of failure, so we wanted to introduce the element of chance into the infrastructure, by allowing a subset of attacks to be performed at random across any host. So we started adding additional tooling to pick random hosts and perform chaos attacks on them. The last piece of the puzzle was putting it all together on an automated schedule. Enter ChaosCat.

Implementation

ChaosCat is a Scala-based Slack chat bot. It builds on top of several other components of our infrastructure, such as our distributed task execution engine. It’s heavily inspired by Chaos Monkey, but more service-implementation-agnostic, as we have a variety of service types in our infrastructure.

First, it’s running as an always-on service. This means it can be used for one-off runs (@chaoscat run-once) at any time by any authorized team. In the meantime, during idle periods a schedule is checked every minute — we only want randomized failures injected during a subset of business hours when there are certain to be awake and ready on-call engineers.

Second, once it’s during business hours, it checks to see if the system status is all-clear. We don’t want to inject a failure if the overall health of our service isn’t 100%.

Third, it fires off a randomly chosen chaos attack (with different attacks having different selection probabilities) to a random host within our infrastructure (no exemptions allowed, as all hosts are equally vulnerable to these issues in the real world). It sends a task to run the chaos attack via the Blender execution framework linked above, using our in-house job runner.

Fourth, it waits 10 minutes and then runs steps two and three again, over and over during a subset of scheduled business hours. If issues arise, attacks can always be stopped by anyone by sending @chaoscat stop.

Learnings

Some teams quickly learned that there’s a world of difference between sitting at the ready with all of your dashboards and logs pulled up, and having something go wrong while you’re getting your morning coffee. These teams identified gaps in their run books and on-call rotations and fixed them. Success!

Another interesting thing: we found that after teams got over their initial discomfort, they automated fixes that had previously been done manually and prioritize technical debt items in their backlog correctly, because the failures causing them had been so infrequent beforehand. This, in turn, caused those teams to have more confidence in their services’ reliability.

Unfortunately, ChaosCat is significantly tied into our internal infrastructure tooling. For the moment this means we won’t be open-sourcing it. However, we’d love to get your feedback and questions about it, so ask away in the PagerDuty Community forums or in the comments below!

We hope that more companies start to practice this kind of reliability engineering — or as some like to say, chaos engineering — it’s a fantastic way to verify the robustness and behavior of increasingly complex and diverse infrastructure.

The post ChaosCat: Automating Fault Injection at PagerDuty appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
"Peak 10 is a hybrid infrastructure provider across the nation. We are in the thick of things when it comes to hybrid IT," explained , Chief Technology Officer at Peak 10, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I think DevOps is now a rambunctious teenager – it’s starting to get a mind of its own, wanting to get its own things but it still needs some adult supervision," explained Thomas Hooker, VP of marketing at CollabNet, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are still a relatively small software house and we are focusing on certain industries like FinTech, med tech, energy and utilities. We help our customers with their digital transformation," noted Piotr Stawinski, Founder and CEO of EARP Integration, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We've been engaging with a lot of customers including Panasonic, we've been involved with Cisco and now we're working with the U.S. government - the Department of Homeland Security," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Everything run by electricity will eventually be connected to the Internet. Get ahead of the Internet of Things revolution and join Akvelon expert and IoT industry leader, Sergey Grebnov, in his session at @ThingsExpo, for an educational dive into the world of managing your home, workplace and all the devices they contain with the power of machine-based AI and intelligent Bot services for a completely streamlined experience.
Any startup has to have a clear go –to-market strategy from the beginning. Similarly, any data science project has to have a go to production strategy from its first days, so it could go beyond proof-of-concept. Machine learning and artificial intelligence in production would result in hundreds of training pipelines and machine learning models that are continuously revised by teams of data scientists and seamlessly connected with web applications for tenants and users.
"We're here to tell the world about our cloud-scale infrastructure that we have at Juniper combined with the world-class security that we put into the cloud," explained Lisa Guess, VP of Systems Engineering at Juniper Networks, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I will be talking about ChatOps and ChatOps as a way to solve some problems in the DevOps space," explained Himanshu Chhetri, CTO of Addteq, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are an IT services solution provider and we sell software to support those solutions. Our focus and key areas are around security, enterprise monitoring, and continuous delivery optimization," noted John Balsavage, President of A&I Solutions, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, provided a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services with...
The financial services market is one of the most data-driven industries in the world, yet it’s bogged down by legacy CPU technologies that simply can’t keep up with the task of querying and visualizing billions of records. In his session at 20th Cloud Expo, Karthik Lalithraj, a Principal Solutions Architect at Kinetica, discussed how the advent of advanced in-database analytics on the GPU makes it possible to run sophisticated data science workloads on the same database that is housing the rich...
DevOps at Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to w...
All organizations that did not originate this moment have a pre-existing culture as well as legacy technology and processes that can be more or less amenable to DevOps implementation. That organizational culture is influenced by the personalities and management styles of Executive Management, the wider culture in which the organization is situated, and the personalities of key team members at all levels of the organization. This culture and entrenched interests usually throw a wrench in the work...
"We want to show that our solution is far less expensive with a much better total cost of ownership so we announced several key features. One is called geo-distributed erasure coding, another is support for KVM and we introduced a new capability called Multi-Part," explained Tim Desai, Senior Product Marketing Manager at Hitachi Data Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...