Welcome!

Blog Feed Post

This Is How Amazon’s Servers Rarely Go Down [Infographic]

Amazon Web Services (AWS), Amazon’s best-in-class cloud services offering, had downtime of only 2.5 hours in 2015. You may think their uptime of 99.9997 percent had something to do with an engineering team of hundreds, a budget of billions, or dozens of data centers across the globe—but you’d be wrong. Amazon’s website, video, and music offerings, and even AWS itself, all leverage multiple AWS products to get five nines of availability, and those are the same products we get to use as consumers. With some clever engineering and good service decisions, anyone can get uptime numbers close to Amazon’s for only a fraction of the cost.

But before we discuss specific techniques to keep your site constantly available, we need to accept a difficult reality: Downtime is inevitable. Even Google was offline in 2015, and if the single largest website can’t get 100 percent uptime, you can be sure it’s impossible for your company to do so too. Instead of trying to prevent downtime, reframe your thinking and do everything you can to make sure your service is as usable as possible even while failure occurs, and then recover from it as quickly as possible.

Here’s how to architect an application to isolate failure, recover rapidly from downtime, and scale in the face of heavy load. (Though this is only a brief overview: there are plenty of great resources online for more detailed descriptions. For example, don’t be afraid to dive into your cloud provider’s documentation. It’s the single best source for discovering all the amazing things they can do for you.)

 

Architecture and Failure Mitigation

Let’s begin by considering your current web application. If your primary database were to go down, how many services would be affected? Would your site be usable at all? How quickly would customers notice?

If your answers are “everything,” “not at all,” and “immediately” you may want to consider a more distributed, failure-resistant application architecture. Microservices—that is, many different, small applications that work together to act like a larger app—are extremely popular as an engineering paradigm. The failure of an individual service is less noticeable to all clients.

For example, consider a basic shop application. If it were all one big service, failure of the database takes the entire site offline; no one can use it at all, even just to browse products or plan purchases. But now let’s say you have microservices instead of a monolith. Instead of a single shop application, perhaps you have an authentication service to login users, a product service to browse the shop, and an order fulfillment service to charge customers and ship goods. A failure in the order fulfillment database means that only people who try to ship see errors.

Losing an element of your operation isn’t ideal, but it’s not anywhere near as bad as having your entire site unavailable. Only a small fraction of customers will be affected, while everyone else can happily browse your store as if nothing was going wrong. And with proper logging, you can note the prospects that had failed requests and reach out to them personally afterward, apologizing for the downtime and hopefully still converting them into paying customers.

This is all possible with a monolithic app, but microservices distribute failure and better isolate it to specific parts of a system. You won’t prevent downtime; instead, you’ll make it affect less people, which is a much more achievable goal.

Databases, Automatic Failover, and Preventing Data Loss

It’s 2 a.m. and a database stops working. What happens to your website? What happens to the data in your database? How long will you be offline?

This used to be the sysadmin nightmare scenario: pray the last backup was usable and recent, downtime would only be a few hours, only a day’s worth of data perished. But nowadays the story is very different, thanks in part to Amazon but also to the power and flexibility of most database software.

If you use the AWS Relational Database Service (RDS), you get daily backups for free, and restoration of a backup is just a click away. Better yet, with a multi-availability zone database, you’re likely to have no downtime at all and the entire database failure will be invisible.

With a multi-AZ database, Amazon keeps an up-to-date copy of your database in another availability zone: a logically separate datacenter from wherever your primary database is. An internet outage, a power blip, or even a comet can take out the primary availability zone, and Amazon will detect the downtime and automatically promote the database copy to be your main database. The process is seamless and happens immediately—chances are, you won’t even experience any data loss.

But availability zones are geographically close together. All of Amazon’s us-east-1 datacenters are in Virginia, only a few miles from each other. Let’s say you also want to protect against the complete failure of all systems in the United States and keep a current copy of your data in Europe or Asia. Here, RDS offers cross-region read replicas that leverage the underlying database technology to create consistent database copies that can be promoted to full-fledged primaries at the touch of a button.

Both MySQL and PostgreSQL, the two most popular relational database systems on the market and available as RDS database drivers, offer native capabilities to ship database events to external follower databases as they occur. Here, RDS takes advantage of a feature that anyone can use, though with Amazon’s strong consumer focus, it’s significantly easier to set up in RDS than to do it manually. Typically, data is shipped to followers simultaneously to data being committed to the primary. Unfortunately, across a continent, you’re looking at a data loss window of about 200 to 500 milliseconds, because an event must be sent from your primary database and be read by the follower.

Still, for recovering a cross-continental consistent backup system, 500 milliseconds is much better than hours. So next time your database fails in the middle of the night, your monitoring service won’t even wake you. Instead you can read about it in the morning—if you can even detect that it occurred. And that means no downtime and no unhappy custom.

Auto Scaling, Repeatability, and Consistency

Amazon’s software-as-a-service (SaaS) offerings, such as RDS, are extremely convenient and very powerful. But they’re far from perfect. Generally, AWS products are much slower to provision compared to running the software directly yourself. Plus, they tend to be several software versions behind the most recent releases.

In databases, this is a fine tradeoff. You almost never create databases so slow that startup doesn’t matter, and you want extremely stable, well-tested, slightly older software. If you try to stay on the bleeding edge, you’ll just end up bloody. But for other services, being locked into Amazon’s product offerings makes less sense.

Once you have an RDS instance, you need some way for customers to get their data into it and for you to interact with that data once it’s there. Specifically, you need web servers. And while Amazon’s Elastic Beanstalk (AWS’ platform to deploy and scale web applications) is conceptually good, in practice it is extremely slow with middling software support, and can be painfully difficult to debug problems.

But AWS’ primary offering has always been the Elastic Compute Cloud (EC2). Running EC2 nodes is fast and easy, and supports any kind of software your application needs. And, unsurprisingly, EC2 offers exceptional tools to mitigate downtime and failure, including auto scaling groups (ASGs). With an ASG, Amazon keeps as many servers up as you specify, even across availability zones. If a server becomes unresponsive or passes other thresholds defined by you (such as amount of incoming traffic or CPU usage), new nodes will automatically spin up.

New servers by themselves do you no good. You need a process to make sure  new nodes are provisioned correctly and consistently so a new server joining  your auto scaling group also has your web  software and credentials to access  your database. Here, you can take  advantage of another Amazon tool, the  Amazon  Machine Image (or AMI). An AMI is a saved copy of an EC2 instance.  Using an AMI, AWS can spin up a new node  that is  an exact copy of the machine that generated the AMI.

Packer, by Hashicorp, makes it easy to create and save AMIs, and is also free and open-source. But there are lots of  amazing tools that can simplify AMI creation. They are the fundamental building blocks of EC2. With clever AMI use you’ll  be able to create new, functional servers in less than 5 minutes.

It’s common to need additional provisioning and configuration even after an AMI is started—perhaps you want to make  sure  the latest version of your application is downloaded onto your servers from GitHub, or that the most recent security  patches  have been applied to your installed packages. In cases such as these a provisioning system is a necessity. Chef  and  Puppet  are the two biggest players in this space, and both offer excellent integrations with AWS. The ideal use case  here i  is an AMI  with credentials to automatically connect to your Chef or Puppet provisioning system, which then ensures  the  newly  created node is as up to date as possible.

 

 

Final Thoughts

By relying on auto scaling groups, AMIs, and a sensible provisioning system, you can create a system that is completely  repeatable and consistent. Any server could go down and be replaced, or 10 more servers could enter your load balancer,  and the process would be seamless, automatic, and almost invisible to you.

And that’s the secret why Amazon’s services rarely go down. Not the hundreds of engineers, or dozens of datacenters, or  even the clever products: It’s the automation. Failure happens, but if you detect it early, isolate it as much as possible, and  recover from it seamlessly—all without requiring human intervention—you’ll be back on your feet before you even knew a  problem occurred.

There are plenty of potential concerns with powerful automated systems like this. How do you ensure new servers are  ones  provisioned by you, and not an attacker trying to join nodes to your cluster? How do you make sure transmitted  copies of  your databases aren’t compromised? How do you prevent a thousand nodes from accidentally starting up and dropping a  massive AWS bill into your lap? This overview of the techniques AWS leverages to prevent downtime and isolate failure  should serve as a good jumping-off point to those more complicated concepts. Ultimately, downtime is  impossible to  prevent, but you can keep it from broadly affecting your customers. Working to keep failure contained and recovery as  rapid as possible leads to a better experience both for you and your users.

 

Share “This Is How Amazon’s Servers Rarely Go Down” On Your Site

The post This Is How Amazon’s Servers Rarely Go Down [Infographic] appeared first on Application Performance Monitoring Blog | AppDynamics.

Read the original blog entry...

More Stories By Jyoti Bansal

In high-production environments where release cycles are measured in hours or minutes — not days or weeks — there's little room for mistakes and no room for confusion. Everyone has to understand what's happening, in real time, and have the means to do whatever is necessary to keep applications up and running optimally.

DevOps is a high-stakes world, but done well, it delivers the agility and performance to significantly impact business competitiveness.

Latest Stories
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, will provide a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services ...
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
What sort of WebRTC based applications can we expect to see over the next year and beyond? One way to predict development trends is to see what sorts of applications startups are building. In his session at @ThingsExpo, Arin Sime, founder of WebRTC.ventures, will discuss the current and likely future trends in WebRTC application development based on real requests for custom applications from real customers, as well as other public sources of information,
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
Interoute has announced the integration of its Global Cloud Infrastructure platform with Rancher Labs’ container management platform, Rancher. This approach enables enterprises to accelerate their digital transformation and infrastructure investments. Matthew Finnie, Interoute CTO commented “Enterprises developing and building apps in the cloud and those on a path to Digital Transformation need Digital ICT Infrastructure that allows them to build, test and deploy faster than ever before. The int...
Have you ever noticed how some IT people seem to lead successful, rewarding, and satisfying lives and careers, while others struggle? IT author and speaker Don Crawley uncovered the five principles that successful IT people use to build satisfying lives and careers and he shares them in this fast-paced, thought-provoking webinar. You'll learn the importance of striking a balance with technical skills and people skills, challenge your pre-existing ideas about IT customer service, and gain new in...
SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure network...
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), will provide an overview of various initiatives to certifiy the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldw...
VeriStor Systems has announced that CRN has named VeriStor to its 2017 Managed Service Provider (MSP) 500 list in the Elite 150 category. This annual list recognizes North American solution providers with cutting-edge approaches to delivering managed services. Their offerings help companies navigate the complex and ever-changing landscape of IT, improve operational efficiencies, and maximize their return on IT investments. In today’s fast-paced business environments, MSPs play an important role...
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In his Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, will explore t...
What if you could build a web application that could support true web-scale traffic without having to ever provision or manage a single server? Sounds magical, and it is! In his session at 20th Cloud Expo, Chris Munns, Senior Developer Advocate for Serverless Applications at Amazon Web Services, will show how to build a serverless website that scales automatically using services like AWS Lambda, Amazon API Gateway, and Amazon S3. We will review several frameworks that can help you build serverle...
SYS-CON Events announced today that Linux Academy, the foremost online Linux and cloud training platform and community, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Linux Academy was founded on the belief that providing high-quality, in-depth training should be available at an affordable price. Industry leaders in quality training, provided services, and student certification passes, its goal is to c...
SYS-CON Events announced today that Interoute, owner-operator of one of Europe's largest networks and a global cloud services platform, has been named “Bronze Sponsor” of SYS-CON's 20th Cloud Expo, which will take place on June 6-8, 2017 at the Javits Center in New York, New York. Interoute is the owner-operator of one of Europe's largest networks and a global cloud services platform which encompasses 12 data centers, 14 virtual data centers and 31 colocation centers, with connections to 195 add...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 20th International Cloud Expo, which will take place on June 6–8, 2017, at the Javits Center in New York City, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.