Welcome!

Blog Feed Post

The Top Causes of Downtime

According to a roundup by Gartner, the average cost of downtime for an enterprise is $5,600 per minute. While the data collected was from incredibly large companies, the cost of downtime for even small startups is no laughing matter.

Let’s assume, for the sake of simplicity, that your core product is a web app that relies solely on organic sales, totaling $1 million in revenue a year. This amounts to about $2 in lost revenue per minute. This doesn’t sound like too much in the grand scheme of things, but revenue is only a small part of your downtime costs. We also must consider wasted operating costs.

Employees’ time and productivity, too, are wasted during downtime. If, for example, you pay $500,000 a year in employee costs, that’s an additional $1 in lost revenue per minute. If you’re keeping track, we’re now at $3 in cost per minute.

That’s $180 an hour. $4,320 a day.

downtime calculator examplehttps://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-30... 300w, https://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-76... 768w, https://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-10... 1024w, https://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-25... 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/downtime-example-18... 180w" sizes="(max-width: 801px) 100vw, 801px" />

Source: downtimecost.com

Adds up quickly, doesn’t it? Now we’ve accounted for employee costs and lost revenue, but what about other wasted expenses? Every unused piece of your architecture results in additional losses during downtime. Unused servers and third-party services can simply sit around while your team is working on a fix, and the fix itself could result in necessary additional (and costly) resources.

Depending on how critical your product is to your customers’ businesses, downtime could not only cost you money, but also your customers’ trust. It’s difficult to justify the cost of paying an unreliable vendor, so while one outage is easily survivable, the loss of faith in your product is compounded with every subsequent occurrence.

Causes + Solutions

Ultimately, by understanding the causes of outages, you can maximize your chances of preventing them. The causes can be boiled down to a few categories — human error, third-party service outage, or a highly unpredictable “black swan” occurrence.

Human Error

https://www.pagerduty.com/wp-content/uploads/2017/03/team-300x241.png 300w, https://www.pagerduty.com/wp-content/uploads/2017/03/team-250x201.png 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/team-180x144.png 180w" sizes="(max-width: 150px) 100vw, 150px" />One of the most common causes of downtime that I’ve personally seen is human error. Regardless of if a developer committed broken code, or an administrator updated an untested package, when procedure isn’t followed or an obscure system bug isn’t accounted for, product uptime will suffer. Establishing a system of checks and balances within an organization is the best solution to this problem. Code reviews, unit tests, quality assurance, proper planning, and clear communication all go a long way in preventing downtime that is definitely avoidable.

Service Outages

https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 150w, https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 300w, https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 250w, https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 180w, https://www.pagerduty.com/wp-content/uploads/2016/01/full-stack-servers-... 35w" sizes="(max-width: 150px) 100vw, 150px" />Sometimes downtime isn’t caused internally, however. From time to time, even cloud providers like Amazon AWS go down. There is very little an organization can do when this happens (at least not without a proper plan in place). To combat this, I’m a fan of Netflix’s Chaos Monkey system. For the uninitiated, Chaos Monkey is a system whose sole job is to kill off random services within a product’s architecture. This forces the system to be self-repairing, and trains the team to handle outages effectively when they really matter. PagerDuty conducts its own Failure Fridays as well!

Alerting

https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 300w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 250w, https://www.pagerduty.com/wp-content/uploads/2015/12/recruit-alert-green... 180w" sizes="(max-width: 182px) 100vw, 182px" />While occasional downtime is completely unavoidable (even Facebook goes down from time to time), how you handle and prepare for it will determine just how much of an impact it will have on your organization. Because every minute of downtime means additional costs, establishing workflows to prevent or reduce the length of an outage is crucial. Solutions like PagerDuty accelerate real-time incident resolution by notifying and getting everyone on the same page as soon as possible, and providing a platform for surfacing context to fix the issue. By aggregating all your event data and optimizing communication, it becomes far easier to identify root cause of an outage, and resolve issues efficiently and accurately.

Communication

https://www.pagerduty.com/wp-content/uploads/2017/03/3things-Negotiate-3... 300w, https://www.pagerduty.com/wp-content/uploads/2017/03/3things-Negotiate-2... 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/3things-Negotiate-1... 180w" sizes="(max-width: 150px) 100vw, 150px" />It’s important to remember that improving communication externally is just as important as improving it internally. Communicating information about an outage to your customers early and clearly goes a long way to maintaining trust and credibility with them. Through the use of tools like StatusPage and StatusCast, as well as PagerDuty’s Stakeholder Engagement, organizations can better orchestrate the real-time business and external-facing response, and use status pages to provide valuable transparency into the health of a product. Personally, I find nothing more distrustful than an organization that remains quiet through a crisis. Their silence feels like an attempt at hiding something.

On-Call Rotations

https://www.pagerduty.com/wp-content/uploads/2017/03/on-call-roations-30... 300w, https://www.pagerduty.com/wp-content/uploads/2017/03/on-call-roations-25... 250w, https://www.pagerduty.com/wp-content/uploads/2017/03/on-call-roations-18... 180w" sizes="(max-width: 157px) 100vw, 157px" />All of these solutions are great, but it’s important to understand that an indispensable part of managing unexpected downtime is to make sure there are always people on hand to fix the issue. This can be easily accomplished by establishing an on-call rotation amongst your engineers. An effective on-call rotation is a minimal investment that can help increase product reliability as well as maintain accountability, better service delivery, and improved work-life balance for your team. Without an on-call rotation, every outage turns into an “all hands” event, which is disruptive to the personal lives of every employee. On the flip side, a clearly defined on-call schedule and escalation policies means that workloads are balanced, and there is always a dedicated subject matter expert that is ready to fix an issue or drive collaboration for resolution as needed.

In the end, the best way to plan for (and mitigate) downtime is to invest in your resources and your team. Not every solution mentioned here is right for every organization, but the cost of doing nothing is significantly higher than the cost of doing something. When you have an established process for handling outages, it won’t matter if it was caused by a hacker or a power outage. You and your team will be prepared to handle it.


Ready to give PagerDuty a try? Sign up for a free 14-day trial.

SIGN UP

The post The Top Causes of Downtime appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
SYS-CON Events announced today that Cloud Academy named "Bronze Sponsor" of 21st International Cloud Expo which will take place October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara, CA. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud com...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that CA Technologies has been named "Platinum Sponsor" of SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business - from apparel to energy - is being rewritten by software. From planning to development to management to security, CA creates software that fuels transformation for companies in the applic...
Multiple data types are pouring into IoT deployments. Data is coming in small packages as well as enormous files and data streams of many sizes. Widespread use of mobile devices adds to the total. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists looked at the tools and environments that are being put to use in IoT deployments, as well as the team skills a modern enterprise IT shop needs to keep things running, get a handle on all this data, and deliver...
After more than five years of DevOps, definitions are evolving, boundaries are expanding, ‘unicorns’ are no longer rare, enterprises are on board, and pundits are moving on. Can we now look at an evolution of DevOps? Should we? Is the foundation of DevOps ‘done’, or is there still too much left to do? What is mature, and what is still missing? What does the next 5 years of DevOps look like? In this Power Panel at DevOps Summit, moderated by DevOps Summit Conference Chair Andi Mann, panelists loo...
Amazon started as an online bookseller 20 years ago. Since then, it has evolved into a technology juggernaut that has disrupted multiple markets and industries and touches many aspects of our lives. It is a relentless technology and business model innovator driving disruption throughout numerous ecosystems. Amazon’s AWS revenues alone are approaching $16B a year making it one of the largest IT companies in the world. With dominant offerings in Cloud, IoT, eCommerce, Big Data, AI, Digital Assista...
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), provided an overview of various initiatives to certify the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldwide re...
While DevOps most critically and famously fosters collaboration, communication, and integration through cultural change, culture is more of an output than an input. In order to actively drive cultural evolution, organizations must make substantial organizational and process changes, and adopt new technologies, to encourage a DevOps culture. Moderated by Andi Mann, panelists discussed how to balance these three pillars of DevOps, where to focus attention (and resources), where organizations might...
New competitors, disruptive technologies, and growing expectations are pushing every business to both adopt and deliver new digital services. This ‘Digital Transformation’ demands rapid delivery and continuous iteration of new competitive services via multiple channels, which in turn demands new service delivery techniques – including DevOps. In this power panel at @DevOpsSummit 20th Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, panelists examined how DevOps helps to meet the de...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
Both SaaS vendors and SaaS buyers are going “all-in” to hyperscale IaaS platforms such as AWS, which is disrupting the SaaS value proposition. Why should the enterprise SaaS consumer pay for the SaaS service if their data is resident in adjacent AWS S3 buckets? If both SaaS sellers and buyers are using the same cloud tools, automation and pay-per-transaction model offered by IaaS platforms, then why not host the “shrink-wrapped” software in the customers’ cloud? Further, serverless computing, cl...
The taxi industry never saw Uber coming. Startups are a threat to incumbents like never before, and a major enabler for startups is that they are instantly “cloud ready.” If innovation moves at the pace of IT, then your company is in trouble. Why? Because your data center will not keep up with frenetic pace AWS, Microsoft and Google are rolling out new capabilities. In his session at 20th Cloud Expo, Don Browning, VP of Cloud Architecture at Turner, posited that disruption is inevitable for comp...
"When we talk about cloud without compromise what we're talking about is that when people think about 'I need the flexibility of the cloud' - it's the ability to create applications and run them in a cloud environment that's far more flexible,” explained Matthew Finnie, CTO of Interoute, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
No hype cycles or predictions of zillions of things here. IoT is big. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, Associate Partner at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He discussed the evaluation of communication standards and IoT messaging protocols, data analytics considerations, edge-to-cloud tec...