Welcome!

Blog Feed Post

An In-depth Analysis of the AWS S3 Outage Impact

Amazon’s AWS (Amazon Web Services) S3 web-based storage service in North America experienced widespread issues beginning at 12:37 PM EST on February 28. As reported on Amazon’s status dashboard, “high error rates with S3 in US-EAST-1.” This was the only explanation provided at the time.

Consequently, many popular online services that utilize S3 such as Quora, Imgur, and Trello suffered from outages throughout the day. This also included Amazon’s very own status dashboard—their status icons are hosted on that service and could not be updated until 14:35 AM EST.

AWS error messagehttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 814px) 100vw, 814px" />

S3 was completely unavailable beginning around 12:37 PM EST, and began improving around 15:45 PM EST, as seen in the chart below.

AWS S3 outagehttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 804px) 100vw, 804px" />

Twitter AWShttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 752px) 100vw, 752px" />

Trello error messagehttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 739px) 100vw, 739px" />

Quora error messagehttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 690px) 100vw, 690px" />

Ironically, Isitdownrightnow.com, a website that reports whether another site is currently unavailable, was also down during this time.

IsItDownRightNow.com error messagehttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 729px) 100vw, 729px" />

AWS continued to provide updates for affected services throughout the day on their status page.

AWS status dashboardhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 374w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 704px) 100vw, 704px" />

AWS status dashboardhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 673px) 100vw, 673px" />

AWS error messageshttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 659px) 100vw, 659px" />

While Quora was unavailable, their website www.quora.com was returning a “504 Gateway Timed Out” error. Using our synthetic monitoring tool, we could see the failures occurring in real time.

Quora errorshttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 726px) 100vw, 726px" />

You can also see the 504 being returned for Quora’s homepage in the Waterfall chart below.

Quora performance charthttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 1423w" sizes="(max-width: 625px) 100vw, 625px" />

Quora performance errorshttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 1416w" sizes="(max-width: 625px) 100vw, 625px" />

Mashable.com was among the many others who also faced significant issues, such as images failing to load, as those items were hosted on S3 buckets. Below is an instant test of Mashable.com, where we saw multiple images not getting served because they were hosted on S3.

Quora errorshttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 892px) 100vw, 892px" />

The traceroute below ran from Catchpoint to one of the S3 buckets. As you can see, timeouts occurred closer to the destination.

http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 912px) 100vw, 912px" />

We can group the downtime into two buckets:

http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 676px) 100vw, 676px" />

http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 676px) 100vw, 676px" />

 

  1. From 12:35 PM EST to 15:35 PM EST – Connections failures

We could not establish a TCP connection to the S3 end points from anywhere in the world (it was not a geo or network transit issue).

  1. From 15:35 to 16:16 PM EST – High wait times and 500 Errors

screen-shot-2017-03-01-at-9-04-09-pmhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2017/03/Screen-S... 624w" sizes="(max-width: 773px) 100vw, 773px" />

At the end of a hectic day, we were left with a cold, hard truth – 100% uptime is unrealistic. Precautions must be taken for when situations like this occur, no matter how robust the system. Monitoring your own services, along with third-party services, enables you to catch performance issues and resolve them in a timely manner to ensure your user base’s confidence in your service. Communication with your users is also crucial when catastrophe strikes. Amazon took the proper steps in communication by being upfront and transparent about the issue across multiple platforms, allowing some reprieve for their users during a time of utter chaos.

We should also remember that the fact that these major websites, services… were completely out of service during this time wasn’t Amazon’s fault. The cloud is still just a bunch of servers, switches, and someone’s code. This means it’s still vulnerable to failures, outages and performance issues, and this isn’t the first time AWS has failed. It’s not Amazon’s responsibility to create a redundancy plan for its customers—it’s the customer’s job to make sure that their business is covered when the services they use fail. Many of the companies that went down yesterday offer products and services that other companies rely on every day to do their jobs.

49nocloudhttp://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/49noclou... 768w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/49noclou... 624w, http://assetsblogfly1.catchpoint.com/wp-content/uploads/2017/03/49noclou... 770w" sizes="(max-width: 300px) 100vw, 300px" />

Having a failsafe contingency plan, like distributing to multiple cloud services and zones, is what determines the amount of damage an outage like this has on a business.

Many people are quick to assume that those affected by this outage were single-entity websites, however the magnitude is much larger than that. The scope of impact ranges anywhere from websites to IoT (Internet of Things)—many companies rely on such cloud services for every aspect of their digital experience. In fact, we found ourselves deeply affected by this outage in several different ways: video conferencing systems Zoom and Bluejeans were down, our office door management system Kisi was inaccessible, Duo Security, and several other tools and systems we use on a daily basis were completely unavailable.

This is the second time in the last several months that our daily operations were severely affected by a major cloud service’s outage. We are now going to be grilling our vendors and asking them:

  • What is your DNS redundancy? A single vendor is not acceptable.
  • Are you on the public Cloud? If so, what is your redundancy plan? Are you on a multi cloud.

We do not use any public cloud service (AWS, Google, Azure); not because we do not want to, but because many of our customers forbid us from using them and now we understand why!

The most important takeaway from this incident is that we all have a duty to our customers to provide the best service possible, under any circumstance. The tools we use will only take us so far—it’s up to us to make sure our critical components are covered by redundancy.

By: Mehdi Daoudi, Nilabh Mishra, Mitchell Zelmanovich, David Lui

The post An In-depth Analysis of the AWS S3 Outage Impact appeared first on Catchpoint's Blog.

Read the original blog entry...

More Stories By Mehdi Daoudi

Catchpoint radically transforms the way businesses manage, monitor, and test the performance of online applications. Truly understand and improve user experience with clear visibility into complex, distributed online systems.

Founded in 2008 by four DoubleClick / Google executives with a passion for speed, reliability and overall better online experiences, Catchpoint has now become the most innovative provider of web performance testing and monitoring solutions. We are a team with expertise in designing, building, operating, scaling and monitoring highly transactional Internet services used by thousands of companies and impacting the experience of millions of users. Catchpoint is funded by top-tier venture capital firm, Battery Ventures, which has invested in category leaders such as Akamai, Omniture (Adobe Systems), Optimizely, Tealium, BazaarVoice, Marketo and many more.

Latest Stories
China Unicom exhibit at the 19th International Cloud Expo, which took place at the Santa Clara Convention Center in Santa Clara, CA, in November 2016. China United Network Communications Group Co. Ltd ("China Unicom") was officially established in 2009 on the basis of the merger of former China Netcom and former China Unicom. China Unicom mainly operates a full range of telecommunications services including mobile broadband (GSM, WCDMA, LTE FDD, TD-LTE), fixed-line broadband, ICT, data communica...
Whether you like it or not, DevOps is on track for a remarkable alliance with security. The SEC didn’t approve the merger. And your boss hasn’t heard anything about it. Yet, this unruly triumvirate will soon dominate and deliver DevSecOps faster, cheaper, better, and on an unprecedented scale. In his session at DevOps Summit, Frank Bunger, VP of Customer Success at ScriptRock, discussed how this cathartic moment will propel the DevOps movement from such stuff as dreams are made on to a practic...
In their Live Hack” presentation at 17th Cloud Expo, Stephen Coty and Paul Fletcher, Chief Security Evangelists at Alert Logic, provided the audience with a chance to see a live demonstration of the common tools cyber attackers use to attack cloud and traditional IT systems. This “Live Hack” used open source attack tools that are free and available for download by anybody. Attendees learned where to find and how to operate these tools for the purpose of testing their own IT infrastructure. The...
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex softw...
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in S...
SYS-CON Events announced today that Cloud Academy will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloud Academy is the industry’s most innovative, vendor-neutral cloud technology training platform. Cloud Academy provides continuous learning solutions for individuals and enterprise teams for Amazon Web Services, Microsoft Azure, Google Cloud Platform, and the most popular cloud computing technologies. Ge...
Historically, some banking activities such as trading have been relying heavily on analytics and cutting edge algorithmic tools. The coming of age of powerful data analytics solutions combined with the development of intelligent algorithms have created new opportunities for financial institutions. In his session at 20th Cloud Expo, Sebastien Meunier, Head of Digital for North America at Chappuis Halder & Co., will discuss how these tools can be leveraged to develop a lasting competitive advanta...
"My role is working with customers, helping them go through this digital transformation. I spend a lot of time talking to banks, big industries, manufacturers working through how they are integrating and transforming their IT platforms and moving them forward," explained William Morrish, General Manager Product Sales at Interoute, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
With billions of sensors deployed worldwide, the amount of machine-generated data will soon exceed what our networks can handle. But consumers and businesses will expect seamless experiences and real-time responsiveness. What does this mean for IoT devices and the infrastructure that supports them? More of the data will need to be handled at - or closer to - the devices themselves.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
The taxi industry never saw Uber coming. Startups are a threat to incumbents like never before, and a major enabler for startups is that they are instantly “cloud ready.” If innovation moves at the pace of IT, then your company is in trouble. Why? Because your data center will not keep up with frenetic pace AWS, Microsoft and Google are rolling out new capabilities In his session at 20th Cloud Expo, Don Browning, VP of Cloud Architecture at Turner, will posit that disruption is inevitable for c...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.