Welcome!

Blog Feed Post

The Top Causes of Downtime

According to a roundup by Gartner, the average cost of downtime for an enterprise is $5,600 per minute. While the data collected was from incredibly large companies, the cost of downtime for even small startups is no laughing matter.

Let’s assume, for the sake of simplicity, that your core product is a web app that relies solely on organic sales, totaling $1 million in revenue a year. This amounts to about $2 in lost revenue per minute. This doesn’t sound like too much in the grand scheme of things, but revenue is only a small part of your downtime costs. We also must consider wasted operating costs.

Employees’ time and productivity, too, are wasted during downtime. If, for example, you pay $500,000 a year in employee costs, that’s an additional $1 in lost revenue per minute. If you’re keeping track, we’re now at $3 in cost per minute.

That’s $180 an hour. $4,320 a day.

downtime calculator examplehttps://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-30... 300w, https://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-76... 768w, https://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-10... 1024w, https://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-25... 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-18... 180w" sizes="(max-width: 801px) 100vw, 801px" />

Source: downtimecost.com

Adds up quickly, doesn’t it? Now we’ve accounted for employee costs and lost revenue, but what about other wasted expenses? Every unused piece of your architecture results in additional losses during downtime. Unused servers and third-party services can simply sit around while your team is working on a fix, and the fix itself could result in necessary additional (and costly) resources.

Depending on how critical your product is to your customers’ businesses, downtime could not only cost you money, but also your customers’ trust. It’s difficult to justify the cost of paying an unreliable vendor, so while one outage is easily survivable, the loss of faith in your product is compounded with every subsequent occurrence.

Causes + Solutions

Ultimately, by understanding the causes of outages, you can maximize your chances of preventing them. The causes can be boiled down to a few categories — human error, third-party service outage, or a highly unpredictable “black swan” occurrence.

Human Error

https://www.pagerduty.com/wp-content/uploads/2017/03/team-300x241.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/03/team-250x201.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/team-180x144.png 180w" sizes="(max-width: 150px) 100vw, 150px" />One of the most common causes of downtime that I’ve personally seen is human error. Regardless of if a developer committed broken code, or an administrator updated an untested package, when procedure isn’t followed or an obscure system bug isn’t accounted for, product uptime will suffer. Establishing a system of checks and balances within an organization is the best solution to this problem. Code reviews, unit tests, quality assurance, proper planning, and clear communication all go a long way in preventing downtime that is definitely avoidable.

Service Outages

https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 150w, https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 300w, https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 250w, https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 180w, https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 35w" sizes="(max-width: 150px) 100vw, 150px" />Sometimes downtime isn’t caused internally, however. From time to time, even cloud providers like Amazon AWS go down. There is very little an organization can do when this happens (at least not without a proper plan in place). To combat this, I’m a fan of Netflix’s Chaos Monkey system. For the uninitiated, Chaos Monkey is a system whose sole job is to kill off random services within a product’s architecture. This forces the system to be self-repairing, and trains the team to handle outages effectively when they really matter. PagerDuty conducts its own Failure Fridays as well!

Alerting

https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 300w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 250w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 180w" sizes="(max-width: 182px) 100vw, 182px" />While occasional downtime is completely unavoidable (even Facebook goes down from time to time), how you handle and prepare for it will determine just how much of an impact it will have on your organization. Because every minute of downtime means additional costs, establishing workflows to prevent or reduce the length of an outage is crucial. Solutions like PagerDuty accelerate real-time incident resolution by notifying and getting everyone on the same page as soon as possible, and providing a platform for surfacing context to fix the issue. By aggregating all your event data and optimizing communication, it becomes far easier to identify root cause of an outage, and resolve issues efficiently and accurately.

Communication

https://www.pagerduty.com/wp-content/uploads/2017/03/3things-Negotiate-3... 300w, https://www.pagerduty.com/wp-content/uploads/2017/03/3things-Negotiate-2... 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/3things-Negotiate-1... 180w" sizes="(max-width: 150px) 100vw, 150px" />It’s important to remember that improving communication externally is just as important as improving it internally. Communicating information about an outage to your customers early and clearly goes a long way to maintaining trust and credibility with them. Through the use of tools like StatusPage and StatusCast, as well as PagerDuty’s Stakeholder Engagement, organizations can better orchestrate the real-time business and external-facing response, and use status pages to provide valuable transparency into the health of a product. Personally, I find nothing more distrustful than an organization that remains quiet through a crisis. Their silence feels like an attempt at hiding something.

On-Call Rotations

https://www.pagerduty.com/wp-content/uploads/2017/03/on-call-roations-30... 300w, https://www.pagerduty.com/wp-content/uploads/2017/03/on-call-roations-25... 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/on-call-roations-18... 180w" sizes="(max-width: 157px) 100vw, 157px" />All of these solutions are great, but it’s important to understand that an indispensable part of managing unexpected downtime is to make sure there are always people on hand to fix the issue. This can be easily accomplished by establishing an on-call rotation amongst your engineers. An effective on-call rotation is a minimal investment that can help increase product reliability as well as maintain accountability, better service delivery, and improved work-life balance for your team. Without an on-call rotation, every outage turns into an “all hands” event, which is disruptive to the personal lives of every employee. On the flip side, a clearly defined on-call schedule and escalation policies means that workloads are balanced, and there is always a dedicated subject matter expert that is ready to fix an issue or drive collaboration for resolution as needed.

In the end, the best way to plan for (and mitigate) downtime is to invest in your resources and your team. Not every solution mentioned here is right for every organization, but the cost of doing nothing is significantly higher than the cost of doing something. When you have an established process for handling outages, it won’t matter if it was caused by a hacker or a power outage. You and your team will be prepared to handle it.


Ready to give PagerDuty a try? Sign up for a free 14-day trial.

SIGN UP

The post The Top Causes of Downtime appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software. They hope to capture value from emerging technologies such as IoT, SDN, and AI. Ultimately, irrespective of the vertical, it is about deriving value from independent software applications participating in an ecosystem as one comprehensive solution. In his session at @ThingsExpo, Kausik Sridhar, founder and CTO of Pulzze Systems, discussed how given the magnitude of today's application ...
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve f...
Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams. In his session at 22nd Cloud Expo | DXWorld Expo, Daniel Jones, CTO of EngineerBetter, will answer: How can we improve willpower and decrease technical debt? Is the present bias real? How can we turn it to our advantage? Can you increase a team’s effective IQ? How do DevOps & Product Teams increase empathy, and what impact does empath...
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to close th...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
"Digital transformation - what we knew about it in the past has been redefined. Automation is going to play such a huge role in that because the culture, the technology, and the business operations are being shifted now," stated Brian Boeggeman, VP of Alliances & Partnerships at Ayehu, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.