|By Don MacVittie||
|May 26, 2016 10:47 AM EDT|
If you’ve ever stood in the ruins of what was once your datacenter and pondered how much work you had to do and how little time you had to do it, then you probably nodded at the title. If you have ever worked to get as much data off of a massive RAID array as possible with blinking red lights telling you that your backups should have been better, you too probably nodded at the title.
It is true that I have experienced both of these situations. A totally flooded datacenter (caused by broken pipes in the ceiling) set us to scrambling so that we could put together something while we restored normal function. The water had come through the ceiling and was a couple feet deep, so the destruction was pretty thorough. In a different role, a RAID array with many disks lost one disk, and before our service contractor came to replace that disk (less than 24 hours), two more went. Eventually, as more people than just us had problems, the entire batch of disks that this RAID devices’ drives came out of was determined to be faulty. Thing is, a ton of our operational intelligence was on those disks in the form of integrations – the system this array was for knitted together a dozen or so systems on several OS/DB combinations, and all the integration code was stored on the array. The system was essentially the cash register of the organization, so downtime was not an option. And I was the manager responsible.
Both of these scenarios came about before DevOps was popular, and in both scenarios we had taken reasonable precautions. But when the fire was burning and the clock was ticking, our reasonable precautions weren’t good enough to get us up and running (even minimally) in a short amount of time. And that “minimally” is massively important. In the flood scenario, the vast majority of our hardware was unusable, and in a highly dynamic environment, some of our code – and even purchased packages – was not in the most recent set of backups. That last bit was true with the RAID array also. We were building something that had never been done before at the scale we were working on, so change was constant, new data inputs were constant, and backups – like most backups – were not continuous.
With DevOps, these types of disasters are still an issue, some of the problems we had will still have to be dealt with, but one of the big issues we ran into – getting new hardware, getting it installed, getting apps on it, and getting it running so customers/users could access something is largely taken care of.
With provisioning – server and app – and continuous integration, the environment you need can be recreated in a short amount of time, assuming you can get hardware to use it on, or are able to use it either hosted or in the cloud for the near term.
Assuming that you are following DevOps practices (I’d say “best practices”, but this is really more fundamental than that), you have configuration and reinstall information in GitHub or Bitbucket or something similar. So getting some of your services back online becomes a case of downloading and installing a tool like Stacki or Cobbler, hooking it to a tool like Puppet or SaltStack, and getting your configuration files down to start deploying servers from RAID to app.
Will it be perfect? Not at all. If your organization has gone all-in and has network configuration information in a tool like Puppet with the Cisco or F5 plugins, for example, it is highly unlikely that short-term network gear while you work things out with an insurance company is going to be configurable by that information. But having DevOps in place will save you a lot of time, because you don’t have to rebuild everything by hand.
And trust me, at that instant, the number one thing you will care about is “How fast can I get things going again?” knowing full well that the answer to that question will be temporary while the real problems are dealt with. Anything that can make that process easier will help, you will already be stressed trying to get someone – be it vendor reps for faulty disk drives or insurance reps for disasters – out to help you do the longer-term recovery, the short term should be as automatic as possible.
Surprisingly, I can’t say “I hope you never have to deal with this”. It is part of life in IT, and I honestly learned a ton from it. The few thousand lines of code and tens of thousands of billable data we lost with the RAID issue was an expensive lesson, but we came out stronger and more resilient. The flooded datacenter gave me a chance to deal with insurance on a scale most people never have to, and (with the help of the other team members of course) to build a shiny new datacenter from the ground up – something we all want to do. But if you have a choice, avoid it. Since you don’t generally have a choice, prepare for it. DevOps is one way of preparing.
SYS-CON Events announced today that Loom Systems will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Founded in 2015, Loom Systems delivers an advanced AI solution to predict and prevent problems in the digital business. Loom stands alone in the industry as an AI analysis platform requiring no prior math knowledge from operators, leveraging the existing staff to succeed in the digital era. With offices in ...
Mar. 23, 2017 12:15 PM EDT Reads: 324
Mar. 23, 2017 12:00 PM EDT Reads: 665
Mar. 23, 2017 11:00 AM EDT Reads: 2,283
Mar. 23, 2017 11:00 AM EDT Reads: 1,858
Mar. 23, 2017 10:45 AM EDT Reads: 1,278
Mar. 23, 2017 09:30 AM EDT Reads: 2,406
Mar. 23, 2017 08:30 AM EDT Reads: 2,582
Mar. 23, 2017 08:00 AM EDT Reads: 865
Mar. 23, 2017 08:00 AM EDT Reads: 1,984
Mar. 23, 2017 08:00 AM EDT Reads: 3,659
Mar. 23, 2017 08:00 AM EDT Reads: 3,218
Mar. 23, 2017 07:45 AM EDT Reads: 994
Mar. 23, 2017 06:45 AM EDT Reads: 1,510
What if you could build a web application that could support true web-scale traffic without having to ever provision or manage a single server? Sounds magical, and it is! In his session at 20th Cloud Expo, Chris Munns, Senior Developer Advocate for Serverless Applications at Amazon Web Services, will show how to build a serverless website that scales automatically using services like AWS Lambda, Amazon API Gateway, and Amazon S3. We will review several frameworks that can help you build serverle...
Mar. 23, 2017 05:30 AM EDT Reads: 1,292
Deep learning has been very successful in social sciences and specially areas where there is a lot of data. Trading is another field that can be viewed as social science with a lot of data. With the advent of Deep Learning and Big Data technologies for efficient computation, we are finally able to use the same methods in investment management as we would in face recognition or in making chat-bots. In his session at 20th Cloud Expo, Gaurav Chakravorty, co-founder and Head of Strategy Development ...
Mar. 23, 2017 05:30 AM EDT Reads: 3,069