Blog Feed Post

7 Steps to Avoiding Downtime

https://www.pagerduty.com/wp-content/uploads/2017/02/avoid-downtime-257x... 257w, https://www.pagerduty.com/wp-content/uploads/2017/02/avoid-downtime-214x... 214w, https://www.pagerduty.com/wp-content/uploads/2017/02/avoid-downtime-154x... 154w" sizes="(max-width: 226px) 100vw, 226px" />Ensure High Availability for Your Applications With These 7 Steps

Several months ago, Delta experienced an IT outage that cost them over $150 million, dropping their overall profit margins by up to 3%. Customers were stranded for hours, 2300 flights were cancelled, and Delta had to pay for thousands of hotel and travel vouchers to compensate for the extended outage — despite a high likelihood that the incident caused certain customers to churn permanently1.

Downtime can strike at any moment with applications and services from even multi-million dollar brands, and just one extended issue can cause a business to lose hundreds of millions of dollars. But situations such as these can be largely avoided if you follow these steps:

  1. Adopt a microservices architecture
    Traditionally, applications were developed in the monolithic style, or by developing the entire app as one whole piece. Today, microservices architectures are becoming increasingly popular. They involve developing, testing, and deploying an application into smaller parts that are not entirely dependent on each other. This makes maintenance much easier because the components of the application are isolated from each other. So, if one particular component experiences failure, it can be targeted and fixed separately without it affecting other components. In a monolithic application, if something goes wrong, the entire app experiences downtime and it’s difficult to find what exactly went wrong. A microservices approach makes your app more resilient to downtime, and is the first step to achieving high availability. However, be aware that microservices architectures introduce far more complexity and increases in the volume of monitoring data generated, so it’s critical to be able to correlate related alerts and suppress non-actionable alerts to reduce overall noise.
  2. Make releases faster, and more frequent
    The biggest benefit of a microservices architecture is that it enables faster releases—multiple times a day for web apps, and bi-weekly for mobile apps. The old order was to have major releases every quarter or so, and downtime was inevitable with every release. With the modern approach, releases are fragmented. Deployments are rolled out to only portions of the application in the background at any one time so that the platform always remains up and running. This not only reduces the risk of downtime, it makes you more competitive as you increase your release velocity to deliver more cutting-edge features and value.
  3. Availability is a quality issue
    Quality and availability go together. A lot of organizations fail to see the importance of QA, to the point of neglecting it until the last minute. To prevent buggy software, the QA team must be involved as early as possible in the development process and tightly involved in the release lifecycle. QA should focus their efforts on automation and testing strategy. A test automation framework can help minimize errors while dramatically reducing costs and saving time in comparison with a manual approach. Additionally, testers do not just look for bugs; they must also be proactively engaged in the requirements process to help steer development in the proper direction. By helping to make sure the development team is building the right way from the beginning, the organization is less likely to have as much technical debt in the future. QA is about constant improvement, and your incentives should target that goal.
  4. Have a disaster recovery plan
    When core services in your app are disrupted, it is a disaster. In these situations, you need a good disaster recovery plan. With most organizations using hybrid architectures with both public and private cloud infrastructure, it’s important to have redundancy across your servers and make backups across different providers. Virtualization can be really useful when making an image backup of an existing physical server, and containerization even more so because the image backups are far more lightweight and take up less space. Strategies such as these ensure your data is available even in a time of disaster. Going further, you need to automate your backup plan end-to-end, so it doesn’t depend on an administrator’s permission especially if they aren’t available. Automation also allows your DevOps team to easily test the disaster recovery plan, and be ready for any disaster that may come their way.
  5. Employ ITSM change management
    Make sure standardized frameworks like ITIL are used for ITSM change management. Changes are highly beneficial to IT services, without which there wouldn’t be progress — but changes made must always be documented. Measure change success rates and publish the results in order to find which teams have a low change success rate. An ITSM tool like ServiceNow is great for more visibility and control over change management. It allows you to make changes quickly, efficiently and with minimal disruption to IT services.
  6. Use an incident management tool
    When inevitable downtime does happen, it’s critical to inform the right people on the team in real-time. But often, teams get too many alerts, and they can miss the really important ones, which affect mean time to resolution (MTTR). An incident management platform like PagerDuty helps manage and group alerts from different monitoring systems and will prove invaluable during an outage. It suppresses non-actionable alerts based on easily defined rules, groups related actionable alerts into incidents, and ensures only the high-priority incidents trigger a notification to the right people, with the right context. Further, with integrations with all your existing monitoring, ticketing, ChatOps and collaboration tools and more, PagerDuty equips your team to troubleshoot and resolve incidents quickly so your app is up and running as much as possible.
  7. Deliberately induce failures
    Planned failure ensures your team is always prepared to resolve any downtime. Netflix is popular for taking this approach. They use a script called Chaos Monkey that constantly runs in the background and randomly shuts down server instances. This helps the team always be prepared in case of real server downtimes, while serving their customers smoothly at the same time. PagerDuty also practices Failure Fridays every week, purposely injecting failure into the system to continuously improve response, ensure preparedness, and maximize reliability.
  8. Although achieving perfection is impossible, focusing on the people, processes, and tools that make up your DevOps team will bring you close. There isn’t a silver bullet that will eliminate all your downtime issues, but as you follow these steps, you’ll build apps that are more reliable, and earn and keep the trust and loyalty of your customers.

    Gensler, Lauren. “Delta’s Computer Outage To Cost Them $150 Million.” Forbes. Forbes Magazine, 07 Sept. 2016. Web. 13 Feb. 2017.


    The post 7 Steps to Avoiding Downtime appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
Today most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes significant work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reducti...
What is the best strategy for selecting the right offshore company for your business? In his session at 21st Cloud Expo, Alan Winters, U.S. Head of Business Development at MobiDev, will discuss the things to look for - positive and negative - in evaluating your options. He will also discuss how to maximize productivity with your offshore developers. Before you start your search, clearly understand your business needs and how that impacts software choices.
Real IoT production deployments running at scale are collecting sensor data from hundreds / thousands / millions of devices. The goal is to take business-critical actions on the real-time data and find insights from stored datasets. In his session at @ThingsExpo, John Walicki, Watson IoT Developer Advocate at IBM Cloud, will provide a fast-paced developer journey that follows the IoT sensor data from generation, to edge gateway, to edge analytics, to encryption, to the IBM Bluemix cloud, to Wa...
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
Most of the time there is a lot of work involved to move to the cloud, and most of that isn't really related to AWS or Azure or Google Cloud. Before we talk about public cloud vendors and DevOps tools, there are usually several technical and non-technical challenges that are connected to it and that every company needs to solve to move to the cloud. In his session at 21st Cloud Expo, Stefano Bellasio, CEO and founder of Cloud Academy Inc., will discuss what the tools, disciplines, and cultural...
SYS-CON Events announced today that Fusic will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Fusic Co. provides mocks as virtual IoT devices. You can customize mocks, and get any amount of data at any time in your test. For more information, visit https://fusic.co.jp/english/.
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
SYS-CON Events announced today that Enroute Lab will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enroute Lab is an industrial design, research and development company of unmanned robotic vehicle system. For more information, please visit http://elab.co.jp/.
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
With the rise of DevOps, containers are at the brink of becoming a pervasive technology in Enterprise IT to accelerate application delivery for the business. When it comes to adopting containers in the enterprise, security is the highest adoption barrier. Is your organization ready to address the security risks with containers for your DevOps environment? In his session at @DevOpsSummit at 21st Cloud Expo, Chris Van Tuin, Chief Technologist, NA West at Red Hat, will discuss: The top security r...
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 21st Cloud Expo, Lennart Frantzell, a Developer Advocate with IBM, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing devel...
SYS-CON Events announced today that Mobile Create USA will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Mobile Create USA Inc. is an MVNO-based business model that uses portable communication devices and cellular-based infrastructure in the development, sales, operation and mobile communications systems incorporating GPS capabi...