Welcome!

Blog Feed Post

An In-depth Analysis of the AWS S3 Outage Impact

Amazon’s AWS (Amazon Web Services) S3 web-based storage service in North America experienced widespread issues beginning at 12:37 PM EST on February 28. As reported on Amazon’s status dashboard, “high error rates with S3 in US-EAST-1.” This was the only explanation provided at the time.

Consequently, many popular online services that utilize S3 such as Quora, Imgur, and Trello suffered from outages throughout the day. This also included Amazon’s very own status dashboard—their status icons are hosted on that service and could not be updated until 14:35 AM EST.

AWS error messagehttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 814px) 100vw, 814px" />

S3 was completely unavailable beginning around 12:37 PM EST, and began improving around 15:45 PM EST, as seen in the chart below.

AWS S3 outagehttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 804px) 100vw, 804px" />

Twitter AWShttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 752px) 100vw, 752px" />

Trello error messagehttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 739px) 100vw, 739px" />

Quora error messagehttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 690px) 100vw, 690px" />

Ironically, Isitdownrightnow.com, a website that reports whether another site is currently unavailable, was also down during this time.

IsItDownRightNow.com error messagehttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 729px) 100vw, 729px" />

AWS continued to provide updates for affected services throughout the day on their status page.

AWS status dashboardhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 374w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 704px) 100vw, 704px" />

AWS status dashboardhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 673px) 100vw, 673px" />

AWS error messageshttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 659px) 100vw, 659px" />

While Quora was unavailable, their website www.quora.com was returning a “504 Gateway Timed Out” error. Using our synthetic monitoring tool, we could see the failures occurring in real time.

Quora errorshttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 726px) 100vw, 726px" />

You can also see the 504 being returned for Quora’s homepage in the Waterfall chart below.

Quora performance charthttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 1423w" sizes="(max-width: 625px) 100vw, 625px" />

Quora performance errorshttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 1416w" sizes="(max-width: 625px) 100vw, 625px" />

Mashable.com was among the many others who also faced significant issues, such as images failing to load, as those items were hosted on S3 buckets. Below is an instant test of Mashable.com, where we saw multiple images not getting served because they were hosted on S3.

Quora errorshttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 892px) 100vw, 892px" />

The traceroute below ran from Catchpoint to one of the S3 buckets. As you can see, timeouts occurred closer to the destination.

http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 912px) 100vw, 912px" />

We can group the downtime into two buckets:

http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 676px) 100vw, 676px" />

http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 676px) 100vw, 676px" />

 

  1. From 12:35 PM EST to 15:35 PM EST – Connections failures

We could not establish a TCP connection to the S3 end points from anywhere in the world (it was not a geo or network transit issue).

  1. From 15:35 to 16:16 PM EST – High wait times and 500 Errors

screen-shot-2017-03-01-at-9-04-09-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 773px) 100vw, 773px" />

At the end of a hectic day, we were left with a cold, hard truth – 100% uptime is unrealistic. Precautions must be taken for when situations like this occur, no matter how robust the system. Monitoring your own services, along with third-party services, enables you to catch performance issues and resolve them in a timely manner to ensure your user base’s confidence in your service. Communication with your users is also crucial when catastrophe strikes. Amazon took the proper steps in communication by being upfront and transparent about the issue across multiple platforms, allowing some reprieve for their users during a time of utter chaos.

We should also remember that the fact that these major websites, services… were completely out of service during this time wasn’t Amazon’s fault. The cloud is still just a bunch of servers, switches, and someone’s code. This means it’s still vulnerable to failures, outages and performance issues, and this isn’t the first time AWS has failed. It’s not Amazon’s responsibility to create a redundancy plan for its customers—it’s the customer’s job to make sure that their business is covered when the services they use fail. Many of the companies that went down yesterday offer products and services that other companies rely on every day to do their jobs.

49nocloudhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/49noclou... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/49noclou... 624w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/49noclou... 770w" sizes="(max-width: 300px) 100vw, 300px" />

Having a failsafe contingency plan, like distributing to multiple cloud services and zones, is what determines the amount of damage an outage like this has on a business.

Many people are quick to assume that those affected by this outage were single-entity websites, however the magnitude is much larger than that. The scope of impact ranges anywhere from websites to IoT (Internet of Things)—many companies rely on such cloud services for every aspect of their digital experience. In fact, we found ourselves deeply affected by this outage in several different ways: video conferencing systems Zoom and Bluejeans were down, our office door management system Kisi was inaccessible, Duo Security, and several other tools and systems we use on a daily basis were completely unavailable.

This is the second time in the last several months that our daily operations were severely affected by a major cloud service’s outage. We are now going to be grilling our vendors and asking them:

  • What is your DNS redundancy? A single vendor is not acceptable.
  • Are you on the public Cloud? If so, what is your redundancy plan? Are you on a multi cloud.

We do not use any public cloud service (AWS, Google, Azure); not because we do not want to, but because many of our customers forbid us from using them and now we understand why!

The most important takeaway from this incident is that we all have a duty to our customers to provide the best service possible, under any circumstance. The tools we use will only take us so far—it’s up to us to make sure our critical components are covered by redundancy.

By: Mehdi Daoudi, Nilabh Mishra, Mitchell Zelmanovich, David Lui

The post An In-depth Analysis of the AWS S3 Outage Impact appeared first on Catchpoint's Blog.

Read the original blog entry...

More Stories By Mehdi Daoudi

Catchpoint radically transforms the way businesses manage, monitor, and test the performance of online applications. Truly understand and improve user experience with clear visibility into complex, distributed online systems.

Founded in 2008 by four DoubleClick / Google executives with a passion for speed, reliability and overall better online experiences, Catchpoint has now become the most innovative provider of web performance testing and monitoring solutions. We are a team with expertise in designing, building, operating, scaling and monitoring highly transactional Internet services used by thousands of companies and impacting the experience of millions of users. Catchpoint is funded by top-tier venture capital firm, Battery Ventures, which has invested in category leaders such as Akamai, Omniture (Adobe Systems), Optimizely, Tealium, BazaarVoice, Marketo and many more.

Latest Stories
We build IoT infrastructure products - when you have to integrate different devices, different systems and cloud you have to build an application to do that but we eliminate the need to build an application. Our products can integrate any device, any system, any cloud regardless of protocol," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
SYS-CON Events announced today that SourceForge has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SourceForge is the largest, most trusted destination for Open Source Software development, collaboration, discovery and download on the web serving over 32 million viewers, 150 million downloads and over 460,000 active development projects each and every month.
Multiple data types are pouring into IoT deployments. Data is coming in small packages as well as enormous files and data streams of many sizes. Widespread use of mobile devices adds to the total. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists looked at the tools and environments that are being put to use in IoT deployments, as well as the team skills a modern enterprise IT shop needs to keep things running, get a handle on all this data, and deliver...
"We do one of the best file systems in the world. We learned how to deal with Big Data many years ago and we implemented this knowledge into our software," explained Jakub Ratajczak, Business Development Manager at MooseFS, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Tintri focuses on the Ops side of the DevOps, which basically is pushing more and more of the accessibility of the infrastructure to the developers and trying to get behind the scenes," explained Dhiraj Sehgal of Tintri in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Cloud applications are seeing a deluge of requests to support the exploding advanced analytics market. “Open analytics” is the emerging strategy to deliver that data through an open data access layer, in the cloud, to be directly consumed by external analytics tools and popular programming languages. An increasing number of data engineers and data scientists use a variety of platforms and advanced analytics languages such as SAS, R, Python and Java, as well as frameworks such as Hadoop and Spark...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to...
Both SaaS vendors and SaaS buyers are going “all-in” to hyperscale IaaS platforms such as AWS, which is disrupting the SaaS value proposition. Why should the enterprise SaaS consumer pay for the SaaS service if their data is resident in adjacent AWS S3 buckets? If both SaaS sellers and buyers are using the same cloud tools, automation and pay-per-transaction model offered by IaaS platforms, then why not host the “shrink-wrapped” software in the customers’ cloud? Further, serverless computing, cl...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
SYS-CON Events announced today that Enzu will exhibit at SYS-CON's 21st Int\ernational Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive advantage. By offering a suite of proven hosting and management services, Enzu wants companies to focus on the core of their ...
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), provided an overview of various initiatives to certify the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldwide re...