Welcome!

Blog Feed Post

An In-depth Analysis of the AWS S3 Outage Impact

Amazon’s AWS (Amazon Web Services) S3 web-based storage service in North America experienced widespread issues beginning at 12:37 PM EST on February 28. As reported on Amazon’s status dashboard, “high error rates with S3 in US-EAST-1.” This was the only explanation provided at the time.

Consequently, many popular online services that utilize S3 such as Quora, Imgur, and Trello suffered from outages throughout the day. This also included Amazon’s very own status dashboard—their status icons are hosted on that service and could not be updated until 14:35 AM EST.

AWS error messagehttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 814px) 100vw, 814px" />

S3 was completely unavailable beginning around 12:37 PM EST, and began improving around 15:45 PM EST, as seen in the chart below.

AWS S3 outagehttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 804px) 100vw, 804px" />

Twitter AWShttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 752px) 100vw, 752px" />

Trello error messagehttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 739px) 100vw, 739px" />

Quora error messagehttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 690px) 100vw, 690px" />

Ironically, Isitdownrightnow.com, a website that reports whether another site is currently unavailable, was also down during this time.

IsItDownRightNow.com error messagehttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 729px) 100vw, 729px" />

AWS continued to provide updates for affected services throughout the day on their status page.

AWS status dashboardhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 374w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 704px) 100vw, 704px" />

AWS status dashboardhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 673px) 100vw, 673px" />

AWS error messageshttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 659px) 100vw, 659px" />

While Quora was unavailable, their website www.quora.com was returning a “504 Gateway Timed Out” error. Using our synthetic monitoring tool, we could see the failures occurring in real time.

Quora errorshttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 726px) 100vw, 726px" />

You can also see the 504 being returned for Quora’s homepage in the Waterfall chart below.

Quora performance charthttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 1423w" sizes="(max-width: 625px) 100vw, 625px" />

Quora performance errorshttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 1416w" sizes="(max-width: 625px) 100vw, 625px" />

Mashable.com was among the many others who also faced significant issues, such as images failing to load, as those items were hosted on S3 buckets. Below is an instant test of Mashable.com, where we saw multiple images not getting served because they were hosted on S3.

Quora errorshttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 892px) 100vw, 892px" />

The traceroute below ran from Catchpoint to one of the S3 buckets. As you can see, timeouts occurred closer to the destination.

http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 912px) 100vw, 912px" />

We can group the downtime into two buckets:

http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 676px) 100vw, 676px" />

http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 676px) 100vw, 676px" />

 

  1. From 12:35 PM EST to 15:35 PM EST – Connections failures

We could not establish a TCP connection to the S3 end points from anywhere in the world (it was not a geo or network transit issue).

  1. From 15:35 to 16:16 PM EST – High wait times and 500 Errors

screen-shot-2017-03-01-at-9-04-09-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 773px) 100vw, 773px" />

At the end of a hectic day, we were left with a cold, hard truth – 100% uptime is unrealistic. Precautions must be taken for when situations like this occur, no matter how robust the system. Monitoring your own services, along with third-party services, enables you to catch performance issues and resolve them in a timely manner to ensure your user base’s confidence in your service. Communication with your users is also crucial when catastrophe strikes. Amazon took the proper steps in communication by being upfront and transparent about the issue across multiple platforms, allowing some reprieve for their users during a time of utter chaos.

We should also remember that the fact that these major websites, services… were completely out of service during this time wasn’t Amazon’s fault. The cloud is still just a bunch of servers, switches, and someone’s code. This means it’s still vulnerable to failures, outages and performance issues, and this isn’t the first time AWS has failed. It’s not Amazon’s responsibility to create a redundancy plan for its customers—it’s the customer’s job to make sure that their business is covered when the services they use fail. Many of the companies that went down yesterday offer products and services that other companies rely on every day to do their jobs.

49nocloudhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/49noclou... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/49noclou... 624w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/49noclou... 770w" sizes="(max-width: 300px) 100vw, 300px" />

Having a failsafe contingency plan, like distributing to multiple cloud services and zones, is what determines the amount of damage an outage like this has on a business.

Many people are quick to assume that those affected by this outage were single-entity websites, however the magnitude is much larger than that. The scope of impact ranges anywhere from websites to IoT (Internet of Things)—many companies rely on such cloud services for every aspect of their digital experience. In fact, we found ourselves deeply affected by this outage in several different ways: video conferencing systems Zoom and Bluejeans were down, our office door management system Kisi was inaccessible, Duo Security, and several other tools and systems we use on a daily basis were completely unavailable.

This is the second time in the last several months that our daily operations were severely affected by a major cloud service’s outage. We are now going to be grilling our vendors and asking them:

  • What is your DNS redundancy? A single vendor is not acceptable.
  • Are you on the public Cloud? If so, what is your redundancy plan? Are you on a multi cloud.

We do not use any public cloud service (AWS, Google, Azure); not because we do not want to, but because many of our customers forbid us from using them and now we understand why!

The most important takeaway from this incident is that we all have a duty to our customers to provide the best service possible, under any circumstance. The tools we use will only take us so far—it’s up to us to make sure our critical components are covered by redundancy.

By: Mehdi Daoudi, Nilabh Mishra, Mitchell Zelmanovich, David Lui

The post An In-depth Analysis of the AWS S3 Outage Impact appeared first on Catchpoint's Blog.

Read the original blog entry...

More Stories By Mehdi Daoudi

Catchpoint radically transforms the way businesses manage, monitor, and test the performance of online applications. Truly understand and improve user experience with clear visibility into complex, distributed online systems.

Founded in 2008 by four DoubleClick / Google executives with a passion for speed, reliability and overall better online experiences, Catchpoint has now become the most innovative provider of web performance testing and monitoring solutions. We are a team with expertise in designing, building, operating, scaling and monitoring highly transactional Internet services used by thousands of companies and impacting the experience of millions of users. Catchpoint is funded by top-tier venture capital firm, Battery Ventures, which has invested in category leaders such as Akamai, Omniture (Adobe Systems), Optimizely, Tealium, BazaarVoice, Marketo and many more.

Latest Stories
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
The past few years have brought a sea change in the way applications are architected, developed, and consumed—increasing both the complexity of testing and the business impact of software failures. How can software testing professionals keep pace with modern application delivery, given the trends that impact both architectures (cloud, microservices, and APIs) and processes (DevOps, agile, and continuous delivery)? This is where continuous testing comes in. D
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone in...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
As you move to the cloud, your network should be efficient, secure, and easy to manage. An enterprise adopting a hybrid or public cloud needs systems and tools that provide: Agility: ability to deliver applications and services faster, even in complex hybrid environments Easier manageability: enable reliable connectivity with complete oversight as the data center network evolves Greater efficiency: eliminate wasted effort while reducing errors and optimize asset utilization Security: imple...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
No hype cycles or predictions of a gazillion things here. IoT is here. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, an Associate Partner of Analytics, IoT & Cybersecurity at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He also discussed the evaluation of communication standards and IoT messaging protocols, data...
Companies are harnessing data in ways we once associated with science fiction. Analysts have access to a plethora of visualization and reporting tools, but considering the vast amount of data businesses collect and limitations of CPUs, end users are forced to design their structures and systems with limitations. Until now. As the cloud toolkit to analyze data has evolved, GPUs have stepped in to massively parallel SQL, visualization and machine learning.
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
Kubernetes is an open source system for automating deployment, scaling, and management of containerized applications. Kubernetes was originally built by Google, leveraging years of experience with managing container workloads, and is now a Cloud Native Compute Foundation (CNCF) project. Kubernetes has been widely adopted by the community, supported on all major public and private cloud providers, and is gaining rapid adoption in enterprises. However, Kubernetes may seem intimidating and complex ...
In his session at 21st Cloud Expo, Michael Burley, a Senior Business Development Executive in IT Services at NetApp, described how NetApp designed a three-year program of work to migrate 25PB of a major telco's enterprise data to a new STaaS platform, and then secured a long-term contract to manage and operate the platform. This significant program blended the best of NetApp’s solutions and services capabilities to enable this telco’s successful adoption of private cloud storage and launching ...