Welcome!

Blog Feed Post

Failure Fridays: Four Years On

On June 28th, 2017, we marked four years of performing “Failure Fridays” at PagerDuty.  As a quick recap, Failure Fridays are a practice we conduct weekly at PagerDuty to inject faults into our production environment in a controlled way, and without customer impact. They’ve been foundational for us to verify our resiliency engineering efforts.

Over the years, our process has evolved, sometimes following Chaos Engineering principles, sometimes not. But the constant of Failure Friday has always been to help us identify and fix problems before they impact our customers.  Here are a few milestones from our journey, and some of the lessons we learned:

Timeline

2013

  • June: The first Failure Friday!

2014

  • February: Our first security-focused Failure Friday.  Instead of testing a single individual service’s fault resiliency through isolation, reboots, and such, we tested for a variety of edge cases, such as APIs being sent invalid data, and firewall misconfiguration.  This practice is still in use today with certain Failure Fridays reserved not for injecting faults on a single service, but instead to test for infrastructure-wide anti-patterns.
  • April:  Less than a year after we started Failure Fridays, we simulated the full failure of one of the seven different Availability Zones our infrastructure operated in at the time.  The first time we did this we went a little overboard with our paranoia and simulated it meticulously over the course of four separate sessions.  Now we usually complete it in around one.

2015

  • January: After 18 months and 33 sessions, we finally automated a lot of the manual commands from the original Failure Friday post into ChatOps-based tooling.  By doing the steps manually at first, we validated and learned from them without having to spend a lot of time up front.  As we grew as a company, it became more and more difficult to ramp up new folks, so we enlisted the help of our company bot:

 

https://www.pagerduty.com/wp-content/uploads/2017/07/image1-1-300x96.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/image1-1-1024x329.png 1024w, https://www.pagerduty.com/wp-content/uploads/2017/07/image1-1-250x80.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/image1-1-180x58.png 180w" sizes="(max-width: 653px) 100vw, 653px" />

  • January:  Once we’d gotten comfortable with the idea of losing a single Availability Zone, we stepped up our game to taking out an entire Region.  These still usually take us a few sessions to complete, as they always generate new learnings for us.
  • March: We realized that Failure Fridays were a great opportunity to exercise our Incident Response process, so we started using it as a training ground for our newest Incident Commanders before they graduated.
  • May: As we started to scale up the number of services and teams maintaining them, we started keeping more formal documentation on planned faults, checklists for future sessions, outcomes of fault injections, and so on.  “It’s not science unless you write it down.”

 

https://www.pagerduty.com/wp-content/uploads/2017/07/image3-300x96.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/image3-1024x327.png 1024w, https://www.pagerduty.com/wp-content/uploads/2017/07/image3-250x80.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/image3-180x58.png 180w" sizes="(max-width: 642px) 100vw, 642px" />

2016

  • April: Another year on, and another large scale set of Failure Friday fault testing – we began simulating failover to our Disaster Recovery infrastructure.  During normal operations, we validate our DR tooling with a small percentage of live traffic, but during these scenarios we ramp up that percentage of live traffic, taking care not to impact our customers.
  • June: We introduced “Reboot Roulette” to our suite of automation, randomly selecting hosts (with weighting for different categories of hosts) to be injected with a fault (rebooting was the first fault of several added, because alliteration of course).

 

https://www.pagerduty.com/wp-content/uploads/2017/07/image5-300x108.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/image5-1024x370.png 1024w, https://www.pagerduty.com/wp-content/uploads/2017/07/image5-250x90.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/image5-180x65.png 180w" sizes="(max-width: 739px) 100vw, 739px" />

 

  • September: At a Hackday, Chaos Cat is introduced, using all of the existing tooling to automate fault injection (at separate time from our normal Failure Friday window).

 

https://www.pagerduty.com/wp-content/uploads/2017/07/image4-1-300x191.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/image4-1-1024x653.png 1024w, https://www.pagerduty.com/wp-content/uploads/2017/07/image4-1-250x160.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/image4-1-180x115.png 180w" sizes="(max-width: 731px) 100vw, 731px" />

2017

  • July: We formed an internal guild of engineers within PagerDuty across multiple teams, all interested in Chaos Engineering.

Stats

Going back through our Failure Friday records, here’s a few metrics from June 28th, 2013 to June 28th, 2017:

  • Failure Friday sessions: 121
  • Tickets created to fix issues identified in Failure Friday: over 200
  • Faults injected: 644
  • Fault injections that resulted in a public postmortem: 3
  • Simulated full AZ failures (disable all services in a given AZ): 4
  • Simulated full Region failures (disable all services in a given region): 3
  • Simulated partial Disaster Recovery (send all traffic to another region): 2
  • Distinct services within PagerDuty that have had faults injected: 47

Conclusions

Injecting failure and continuously improving our infrastructure has not only helped us deliver better software, but also build internal trust and empathy. Stress testing our systems and processes helps us understand how to improve our operations — and you can do it too.

 

https://www.pagerduty.com/wp-content/uploads/2017/07/image2-1-300x63.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/07/image2-1-250x53.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/07/image2-1-180x38.png 180w" sizes="(max-width: 497px) 100vw, 497px" />

The post Failure Fridays: Four Years On appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
"As we've gone out into the public cloud we've seen that over time we may have lost a few things - we've lost control, we've given up cost to a certain extent, and then security, flexibility," explained Steve Conner, VP of Sales at Cloudistics,in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
"The Striim platform is a full end-to-end streaming integration and analytics platform that is middleware that covers a lot of different use cases," explained Steve Wilkes, Founder and CTO at Striim, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We want to show that our solution is far less expensive with a much better total cost of ownership so we announced several key features. One is called geo-distributed erasure coding, another is support for KVM and we introduced a new capability called Multi-Part," explained Tim Desai, Senior Product Marketing Manager at Hitachi Data Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We provide IoT solutions. We provide the most compatible solutions for many applications. Our solutions are industry agnostic and also protocol agnostic," explained Richard Han, Head of Sales and Marketing and Engineering at Systena America, in this SYS-CON.tv interview at @ThingsExpo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"With Digital Experience Monitoring what used to be a simple visit to a web page has exploded into app on phones, data from social media feeds, competitive benchmarking - these are all components that are only available because of some type of digital asset," explained Leo Vasiliou, Director of Web Performance Engineering at Catchpoint Systems, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
SYS-CON Events announced today that DXWorldExpo has been named “Global Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Digital Transformation is the key issue driving the global enterprise IT business. Digital Transformation is most prominent among Global 2000 enterprises and government institutions.
SYS-CON Events announced today that Datera, that offers a radically new data management architecture, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Datera is transforming the traditional datacenter model through modern cloud simplicity. The technology industry is at another major inflection point. The rise of mobile, the Internet of Things, data storage and Big...
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex ...
SYS-CON Events announced today that Calligo, an innovative cloud service provider offering mid-sized companies the highest levels of data privacy and security, has been named "Bronze Sponsor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Calligo offers unparalleled application performance guarantees, commercial flexibility and a personalised support service from its globally located cloud plat...
"We focus on SAP workloads because they are among the most powerful but somewhat challenging workloads out there to take into public cloud," explained Swen Conrad, CEO of Ocean9, Inc., in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Outscale was founded in 2010, is based in France, is a strategic partner to Dassault Systémes and has done quite a bit of work with divisions of Dassault," explained Jackie Funk, Digital Marketing exec at Outscale, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"We are still a relatively small software house and we are focusing on certain industries like FinTech, med tech, energy and utilities. We help our customers with their digital transformation," noted Piotr Stawinski, Founder and CEO of EARP Integration, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"I think DevOps is now a rambunctious teenager – it’s starting to get a mind of its own, wanting to get its own things but it still needs some adult supervision," explained Thomas Hooker, VP of marketing at CollabNet, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.