Blog Feed Post

7 Steps to Avoiding Downtime

https://www.pagerduty.com/wp-content/uploads/2017/02/avoid-downtime-257x... 257w, https://www.pagerduty.com/wp-content/uploads/2017/02/avoid-downtime-214x... 214w, https://www.pagerduty.com/wp-content/uploads/2017/02/avoid-downtime-154x... 154w" sizes="(max-width: 226px) 100vw, 226px" />Ensure High Availability for Your Applications With These 7 Steps

Several months ago, Delta experienced an IT outage that cost them over $150 million, dropping their overall profit margins by up to 3%. Customers were stranded for hours, 2300 flights were cancelled, and Delta had to pay for thousands of hotel and travel vouchers to compensate for the extended outage — despite a high likelihood that the incident caused certain customers to churn permanently1.

Downtime can strike at any moment with applications and services from even multi-million dollar brands, and just one extended issue can cause a business to lose hundreds of millions of dollars. But situations such as these can be largely avoided if you follow these steps:

  1. Adopt a microservices architecture
    Traditionally, applications were developed in the monolithic style, or by developing the entire app as one whole piece. Today, microservices architectures are becoming increasingly popular. They involve developing, testing, and deploying an application into smaller parts that are not entirely dependent on each other. This makes maintenance much easier because the components of the application are isolated from each other. So, if one particular component experiences failure, it can be targeted and fixed separately without it affecting other components. In a monolithic application, if something goes wrong, the entire app experiences downtime and it’s difficult to find what exactly went wrong. A microservices approach makes your app more resilient to downtime, and is the first step to achieving high availability. However, be aware that microservices architectures introduce far more complexity and increases in the volume of monitoring data generated, so it’s critical to be able to correlate related alerts and suppress non-actionable alerts to reduce overall noise.
  2. Make releases faster, and more frequent
    The biggest benefit of a microservices architecture is that it enables faster releases—multiple times a day for web apps, and bi-weekly for mobile apps. The old order was to have major releases every quarter or so, and downtime was inevitable with every release. With the modern approach, releases are fragmented. Deployments are rolled out to only portions of the application in the background at any one time so that the platform always remains up and running. This not only reduces the risk of downtime, it makes you more competitive as you increase your release velocity to deliver more cutting-edge features and value.
  3. Availability is a quality issue
    Quality and availability go together. A lot of organizations fail to see the importance of QA, to the point of neglecting it until the last minute. To prevent buggy software, the QA team must be involved as early as possible in the development process and tightly involved in the release lifecycle. QA should focus their efforts on automation and testing strategy. A test automation framework can help minimize errors while dramatically reducing costs and saving time in comparison with a manual approach. Additionally, testers do not just look for bugs; they must also be proactively engaged in the requirements process to help steer development in the proper direction. By helping to make sure the development team is building the right way from the beginning, the organization is less likely to have as much technical debt in the future. QA is about constant improvement, and your incentives should target that goal.
  4. Have a disaster recovery plan
    When core services in your app are disrupted, it is a disaster. In these situations, you need a good disaster recovery plan. With most organizations using hybrid architectures with both public and private cloud infrastructure, it’s important to have redundancy across your servers and make backups across different providers. Virtualization can be really useful when making an image backup of an existing physical server, and containerization even more so because the image backups are far more lightweight and take up less space. Strategies such as these ensure your data is available even in a time of disaster. Going further, you need to automate your backup plan end-to-end, so it doesn’t depend on an administrator’s permission especially if they aren’t available. Automation also allows your DevOps team to easily test the disaster recovery plan, and be ready for any disaster that may come their way.
  5. Employ ITSM change management
    Make sure standardized frameworks like ITIL are used for ITSM change management. Changes are highly beneficial to IT services, without which there wouldn’t be progress — but changes made must always be documented. Measure change success rates and publish the results in order to find which teams have a low change success rate. An ITSM tool like ServiceNow is great for more visibility and control over change management. It allows you to make changes quickly, efficiently and with minimal disruption to IT services.
  6. Use an incident management tool
    When inevitable downtime does happen, it’s critical to inform the right people on the team in real-time. But often, teams get too many alerts, and they can miss the really important ones, which affect mean time to resolution (MTTR). An incident management platform like PagerDuty helps manage and group alerts from different monitoring systems and will prove invaluable during an outage. It suppresses non-actionable alerts based on easily defined rules, groups related actionable alerts into incidents, and ensures only the high-priority incidents trigger a notification to the right people, with the right context. Further, with integrations with all your existing monitoring, ticketing, ChatOps and collaboration tools and more, PagerDuty equips your team to troubleshoot and resolve incidents quickly so your app is up and running as much as possible.
  7. Deliberately induce failures
    Planned failure ensures your team is always prepared to resolve any downtime. Netflix is popular for taking this approach. They use a script called Chaos Monkey that constantly runs in the background and randomly shuts down server instances. This helps the team always be prepared in case of real server downtimes, while serving their customers smoothly at the same time. PagerDuty also practices Failure Fridays every week, purposely injecting failure into the system to continuously improve response, ensure preparedness, and maximize reliability.
  8. Although achieving perfection is impossible, focusing on the people, processes, and tools that make up your DevOps team will bring you close. There isn’t a silver bullet that will eliminate all your downtime issues, but as you follow these steps, you’ll build apps that are more reliable, and earn and keep the trust and loyalty of your customers.

    Gensler, Lauren. “Delta’s Computer Outage To Cost Them $150 Million.” Forbes. Forbes Magazine, 07 Sept. 2016. Web. 13 Feb. 2017.


    The post 7 Steps to Avoiding Downtime appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
The Internet of Things can drive efficiency for airlines and airports. In their session at @ThingsExpo, Shyam Varan Nath, Principal Architect with GE, and Sudip Majumder, senior director of development at Oracle, discussed the technical details of the connected airline baggage and related social media solutions. These IoT applications will enhance travelers' journey experience and drive efficiency for the airlines and the airports.
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
The explosion of new web/cloud/IoT-based applications and the data they generate are transforming our world right before our eyes. In this rush to adopt these new technologies, organizations are often ignoring fundamental questions concerning who owns the data and failing to ask for permission to conduct invasive surveillance of their customers. Organizations that are not transparent about how their systems gather data telemetry without offering shared data ownership risk product rejection, regu...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settle...
In his keynote at @ThingsExpo, Chris Matthieu, Director of IoT Engineering at Citrix and co-founder and CTO of Octoblu, focused on building an IoT platform and company. He provided a behind-the-scenes look at Octoblu’s platform, business, and pivots along the way (including the Citrix acquisition of Octoblu).
In his general session at 18th Cloud Expo, Lee Atchison, Principal Cloud Architect and Advocate at New Relic, discussed cloud as a ‘better data center’ and how it adds new capacity (faster) and improves application availability (redundancy). The cloud is a ‘Dynamic Tool for Dynamic Apps’ and resource allocation is an integral part of your application architecture, so use only the resources you need and allocate /de-allocate resources on the fly.
Containers have changed the mind of IT in DevOps. They enable developers to work with dev, test, stage and production environments identically. Containers provide the right abstraction for microservices and many cloud platforms have integrated them into deployment pipelines. DevOps and Containers together help companies to achieve their business goals faster and more effectively. In his session at DevOps Summit, Ruslan Synytsky, CEO and Co-founder of Jelastic, reviewed the current landscape of D...
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
Security, data privacy, reliability and regulatory compliance are critical factors when evaluating whether to move business applications from in-house client hosted environments to a cloud platform. In her session at 18th Cloud Expo, Vandana Viswanathan, Associate Director at Cognizant, In this session, will provide an orientation to the five stages required to implement a cloud hosted solution validation strategy.
SYS-CON Events announced today that Outlyer, a monitoring service for DevOps and operations teams, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Outlyer is a monitoring service for DevOps and Operations teams running Cloud, SaaS, Microservices and IoT deployments. Designed for today's dynamic environments that need beyond cloud-scale monitoring, we make monitoring effortless so you...
Cloud Expo, Inc. has announced today that Andi Mann and Aruna Ravichandran have been named Co-Chairs of @DevOpsSummit at Cloud Expo 2017. The @DevOpsSummit at Cloud Expo New York will take place on June 6-8, 2017, at the Javits Center in New York City, New York, and @DevOpsSummit at Cloud Expo Silicon Valley will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
DevOps is being widely accepted (if not fully adopted) as essential in enterprise IT. But as Enterprise DevOps gains maturity, expands scope, and increases velocity, the need for data-driven decisions across teams becomes more acute. DevOps teams in any modern business must wrangle the ‘digital exhaust’ from the delivery toolchain, "pervasive" and "cognitive" computing, APIs and services, mobile devices and applications, the Internet of Things, and now even blockchain.
Column Technologies exhibited at SYS-CON's @DevOpsSummit at Cloud Expo, which took place at the Javits Center in New York City, NY, in June 2016. Established in 1998, Column Technologies is a global technology solutions provider with over 400 employees, headquartered in the United States with offices in Canada, India, and the United Kingdom. Column Technologies provides “Best of Breed” technology solutions that automate the key DevOps principals and help our customers meet today’s DevOps and Dig...
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.