Welcome!

Blog Feed Post

7 Steps to Avoiding Downtime

https://www.pagerduty.com/wp-content/uploads/2017/02/avoid-downtime-257x... 257w, https://www.pagerduty.com/wp-content/uploads/2017/02/avoid-downtime-214x... 214w, https://www.pagerduty.com/wp-content/uploads/2017/02/avoid-downtime-154x... 154w" sizes="(max-width: 226px) 100vw, 226px" />Ensure High Availability for Your Applications With These 7 Steps

Several months ago, Delta experienced an IT outage that cost them over $150 million, dropping their overall profit margins by up to 3%. Customers were stranded for hours, 2300 flights were cancelled, and Delta had to pay for thousands of hotel and travel vouchers to compensate for the extended outage — despite a high likelihood that the incident caused certain customers to churn permanently1.

Downtime can strike at any moment with applications and services from even multi-million dollar brands, and just one extended issue can cause a business to lose hundreds of millions of dollars. But situations such as these can be largely avoided if you follow these steps:

  1. Adopt a microservices architecture
    Traditionally, applications were developed in the monolithic style, or by developing the entire app as one whole piece. Today, microservices architectures are becoming increasingly popular. They involve developing, testing, and deploying an application into smaller parts that are not entirely dependent on each other. This makes maintenance much easier because the components of the application are isolated from each other. So, if one particular component experiences failure, it can be targeted and fixed separately without it affecting other components. In a monolithic application, if something goes wrong, the entire app experiences downtime and it’s difficult to find what exactly went wrong. A microservices approach makes your app more resilient to downtime, and is the first step to achieving high availability. However, be aware that microservices architectures introduce far more complexity and increases in the volume of monitoring data generated, so it’s critical to be able to correlate related alerts and suppress non-actionable alerts to reduce overall noise.
  2. Make releases faster, and more frequent
    The biggest benefit of a microservices architecture is that it enables faster releases—multiple times a day for web apps, and bi-weekly for mobile apps. The old order was to have major releases every quarter or so, and downtime was inevitable with every release. With the modern approach, releases are fragmented. Deployments are rolled out to only portions of the application in the background at any one time so that the platform always remains up and running. This not only reduces the risk of downtime, it makes you more competitive as you increase your release velocity to deliver more cutting-edge features and value.
  3. Availability is a quality issue
    Quality and availability go together. A lot of organizations fail to see the importance of QA, to the point of neglecting it until the last minute. To prevent buggy software, the QA team must be involved as early as possible in the development process and tightly involved in the release lifecycle. QA should focus their efforts on automation and testing strategy. A test automation framework can help minimize errors while dramatically reducing costs and saving time in comparison with a manual approach. Additionally, testers do not just look for bugs; they must also be proactively engaged in the requirements process to help steer development in the proper direction. By helping to make sure the development team is building the right way from the beginning, the organization is less likely to have as much technical debt in the future. QA is about constant improvement, and your incentives should target that goal.
  4. Have a disaster recovery plan
    When core services in your app are disrupted, it is a disaster. In these situations, you need a good disaster recovery plan. With most organizations using hybrid architectures with both public and private cloud infrastructure, it’s important to have redundancy across your servers and make backups across different providers. Virtualization can be really useful when making an image backup of an existing physical server, and containerization even more so because the image backups are far more lightweight and take up less space. Strategies such as these ensure your data is available even in a time of disaster. Going further, you need to automate your backup plan end-to-end, so it doesn’t depend on an administrator’s permission especially if they aren’t available. Automation also allows your DevOps team to easily test the disaster recovery plan, and be ready for any disaster that may come their way.
  5. Employ ITSM change management
    Make sure standardized frameworks like ITIL are used for ITSM change management. Changes are highly beneficial to IT services, without which there wouldn’t be progress — but changes made must always be documented. Measure change success rates and publish the results in order to find which teams have a low change success rate. An ITSM tool like ServiceNow is great for more visibility and control over change management. It allows you to make changes quickly, efficiently and with minimal disruption to IT services.
  6. Use an incident management tool
    When inevitable downtime does happen, it’s critical to inform the right people on the team in real-time. But often, teams get too many alerts, and they can miss the really important ones, which affect mean time to resolution (MTTR). An incident management platform like PagerDuty helps manage and group alerts from different monitoring systems and will prove invaluable during an outage. It suppresses non-actionable alerts based on easily defined rules, groups related actionable alerts into incidents, and ensures only the high-priority incidents trigger a notification to the right people, with the right context. Further, with integrations with all your existing monitoring, ticketing, ChatOps and collaboration tools and more, PagerDuty equips your team to troubleshoot and resolve incidents quickly so your app is up and running as much as possible.
  7. Deliberately induce failures
    Planned failure ensures your team is always prepared to resolve any downtime. Netflix is popular for taking this approach. They use a script called Chaos Monkey that constantly runs in the background and randomly shuts down server instances. This helps the team always be prepared in case of real server downtimes, while serving their customers smoothly at the same time. PagerDuty also practices Failure Fridays every week, purposely injecting failure into the system to continuously improve response, ensure preparedness, and maximize reliability.
  8. Although achieving perfection is impossible, focusing on the people, processes, and tools that make up your DevOps team will bring you close. There isn’t a silver bullet that will eliminate all your downtime issues, but as you follow these steps, you’ll build apps that are more reliable, and earn and keep the trust and loyalty of your customers.





    Gensler, Lauren. “Delta’s Computer Outage To Cost Them $150 Million.” Forbes. Forbes Magazine, 07 Sept. 2016. Web. 13 Feb. 2017.

     

    The post 7 Steps to Avoiding Downtime appeared first on PagerDuty.

Read the original blog entry...

More Stories By PagerDuty Blog

PagerDuty’s operations performance platform helps companies increase reliability. By connecting people, systems and data in a single view, PagerDuty delivers visibility and actionable intelligence across global operations for effective incident resolution management. PagerDuty has over 100 platform partners, and is trusted by Fortune 500 companies and startups alike, including Microsoft, National Instruments, Electronic Arts, Adobe, Rackspace, Etsy, Square and Github.

Latest Stories
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
SYS-CON Events announced today that EARP Integration will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. EARP Integration is a passionate software house. Since its inception in 2009 the company successfully delivers smart solutions for cities and factories that start their digital transformation. EARP provides bespoke solutions like, for example, advanced enterprise portals, business intelligence systems an...
We build IoT infrastructure products - when you have to integrate different devices, different systems and cloud you have to build an application to do that but we eliminate the need to build an application. Our products can integrate any device, any system, any cloud regardless of protocol," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA
Existing Big Data solutions are mainly focused on the discovery and analysis of data. The solutions are scalable and highly available but tedious when swapping in and swapping out occurs in disarray and thrashing takes place. The resolution for thrashing through machine learning algorithms and support nomenclature is through simple techniques. Organizations that have been collecting large customer data are increasingly seeing the need to use the data for swapping in and out and thrashing occurs ...
SYS-CON Events announced today that Enzu will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive ad...
SYS-CON Events announced today that delaPlex will exhibit at SYS-CON's @CloudExpo, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. delaPlex pioneered Software Development as a Service (SDaaS), which provides scalable resources to build, test, and deploy software. It’s a fast and more reliable way to develop a new product or expand your in-house team.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
SYS-CON Events announced today that Interoute has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Interoute is the owner operator of Europe's largest network and a global cloud services platform, which encompasses over 70,000 km of lit fiber, 15 data centers, 17 virtual data centers and 33 colocation centers, with connections to 195 additional partner data centers. Our full-service Unifie...
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs oft...
SYS-CON Events announced today that WineSOFT will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Based in Seoul and Irvine, WineSOFT is an innovative software house focusing on internet infrastructure solutions. The venture started as a bootstrap start-up in 2010 by focusing on making the internet faster and more powerful. WineSOFT’s knowledge is based on the expertise of TCP/IP, VPN, SSL, peer-to-peer, mob...
SYS-CON Events announced today that Carbonite will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Carbonite protects your entire IT footprint with the right level of protection for each workload, ensuring lower costs and dependable solutions with DoubleTake and Evault.
When NSA's digital armory was leaked, it was only a matter of time before the code was morphed into a ransom seeking worm. This talk, designed for C-level attendees, demonstrates a Live Hack of a virtual environment to show the ease in which any average user can leverage these tools and infiltrate their network environment. This session will include an overview of the Shadbrokers NSA leak situation.
Cloud-based disaster recovery is critical to any production environment and is a high priority for many enterprise organizations today. Nearly 40% of organizations have had to execute their BCDR plan due to a service disruption in the past two years. Zerto on IBM Cloud offer VMware and Microsoft customers simple, automated recovery of on-premise VMware and Microsoft workloads to IBM Cloud data centers.
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
SYS-CON Events announced today that Progress, a global leader in application development, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Enterprises today are rapidly adopting the cloud, while continuing to retain business-critical/sensitive data inside the firewall. This is creating two separate data silos – one inside the firewall and the other outside the firewall. Cloud ISVs ofte...