Blog Feed Post

DNS Outage Was Doomsday for the Internet

What was supposed to be a quiet Friday suddenly turned into a real “Black Friday” for us (as well as most of the Internet) when Dyn suffered a major DDOS attack. From an internet disruption’s perspective, the widespread damage the outage caused made it the worst I have ever experienced.

At the core of it all, the managed DNS provider Dyn was targeted in a DDOS attack that impacted thousands of web properties, services, SaaS providers, and more.

The chart below shows the DNS resolution time and availability of twitter.com from around the world. There were three clear waves of outages:

  • 7:10 EST to 9:10 EST
  • 11:52 EST to 16:33 EST
  • 19:13 EST to 20:38 EST

dns-twitterhttp://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/DNS-Twit... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/DNS-Twit... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/DNS-Twit... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/DNS-Twit... 1453w" sizes="(max-width: 625px) 100vw, 625px" />

The DNS failures were the result of Dyn nameservers not responding to DNS queries for more than four seconds.

We were impacted in three ways:

  • Our domain Catchpoint.com was not reachable for a solid 30 minutes until we introduced our secondary managed DNS provider Verisign. We also brought up and publicized to our customers a backup domain that was never on Dyn, so our customers could login to our portal and keep an eye on their online services. All of these were in standby mode prior to the incident.
  • Our nodes could not reliably talk to our globally distributed command and control systems until we switched to IP only mode, bypassing DNS lookups. This was a feature we had developed, tested, and in production, but was not active as our engineering teams planned one more enhancement. Due to the nature of the situation, we deemed the enhancement to be lower risk than what we were experiencing.
  • Many of our own third party vendors that our company relies on stopped working- Customer Support and Online Help solution, CRM, office door badging system, SSO, 2 Factor Authentication services, one of the CDNs, a file sharing solution, and the list goes on and on.

This blog post is not about finger pointing; the folks at Dyn had a horrible day putting up with their worst nightmare. They did an amazing job of dealing with it, from notifications to extinguishing the fire. This is about how to deal with the worst case outage, as a company and an industry.

As with every outage, it’s important to take the time to reflect on what took place and how this can be avoided in the future.

Here are some of my takeaways from Friday, and the must-have solutions:

  • DNS is still one of the weakest links in our Internet infrastructure and digital economy. We have to keep learning and sharing that knowledge with each other. Here are several articles we have written on DNS.
  • A single DNS provider is not an option anymore for anyone. No company, small or large, can rely on a single DNS provider.
  • DNS vendors should create knowledge base articles about how to introduce secondary DNS providers, and they must be easy to find and follow.
  • DNS vendors need to make the setup of auto – transfer easier to find. Having to open a ticket in a middle of a crisis to find out the IP of the xtransfer name servers is simply not a viable option.
  • DNS Vendors should not set high TTLs (two days) on the authoritative nameserver records they pass on the DNS queries, and it should be easy to drop or change TTL. While this is great to bypass changing records on the TLDs, making the nameservers authoritative for two days becomes a headache when trying to switch to or migrate from a back-up solution.

image001http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image001... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image001... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image001... 885w" sizes="(max-width: 300px) 100vw, 300px" />

Introducing another DNS vendor wouldn’t have achieved 100% of the result until you go into the Dyn configuration and add that other solution in the mix:

Some takeaways from a monitoring standpoint:

I had people tell me, “But Mehdi, I am not seeing a problem in my RUM.” When your site isn’t reachable, RUM won’t tell you anything because there is no user activity to show. This is why your monitoring strategy must include synthetic and RUM.

  • DNS monitoring is critical to understand the “why.”
  • DNS performance impacts web performance.
  • The impact was so incredible, some sites that didn’t rely on Dyn still suffered outages or bad user experience. This is because they used third parties that did rely on Dyn.

We interact with many things on a daily basis (cars, cell phones, planes, hair dryers) that have some sort of certification. I urge whoever is responsible to consider the following:

  • A ban on any Internet-connected device that does not force the change of default credential upon starting it. There shouldn’t Admin/Admin for anything including cameras, refrigerators, access points, routers, etc.
  • A ban on accessing of such devices from any place on the Internet. There should be some limitation, either access through the provider interface or from local network.
  • Consumers should also pressure the industry by not buying the products that aren’t safe. Maybe we need an “Internet Safety Rating” from a governmental agency or worldwide organization.
  • A must-have feature on every home and SMB router, and access point is the ability to detect abnormal traffic/activity and turn it off or slow it down; sending thousands of DNS requests in a minute is not normal. We should learn from Microsoft and what they did with Windows XP to limit an infected host.
  • Local ISPs must have capabilities to detect and stop rogue traffic.

Cybersecurity is dire. I hope this incident serves as a huge wake-up call for everyone. What happened Friday was a Code Blue event; we rely on the Internet for practically everything in society today, and it’s our job to do everything we can to protect it.

Thank you, Dyn, for the prompt response times to the support tickets, to Verisign for last-minute questions, our customers who were very patient and understanding, our entire support organization, and some special friends in major companies who offered a helping hand by providing some amazing advice around DNS.

Mehdi – Catchpoint CEO and Co-Founder

To learn more about how you can handle a major outage like this in the future, join our upcoming Ask Me Anything: OUTAGE! with VictorOps, Target, and Release Engineering Approaches.

The post DNS Outage Was Doomsday for the Internet appeared first on Catchpoint's Blog.

Read the original blog entry...

More Stories By Mehdi Daoudi

Catchpoint radically transforms the way businesses manage, monitor, and test the performance of online applications. Truly understand and improve user experience with clear visibility into complex, distributed online systems.

Founded in 2008 by four DoubleClick / Google executives with a passion for speed, reliability and overall better online experiences, Catchpoint has now become the most innovative provider of web performance testing and monitoring solutions. We are a team with expertise in designing, building, operating, scaling and monitoring highly transactional Internet services used by thousands of companies and impacting the experience of millions of users. Catchpoint is funded by top-tier venture capital firm, Battery Ventures, which has invested in category leaders such as Akamai, Omniture (Adobe Systems), Optimizely, Tealium, BazaarVoice, Marketo and many more.

Latest Stories
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone inn...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to close th...
Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams. In his session at 22nd Cloud Expo | DXWorld Expo, Daniel Jones, CTO of EngineerBetter, will answer: How can we improve willpower and decrease technical debt? Is the present bias real? How can we turn it to our advantage? Can you increase a team’s effective IQ? How do DevOps & Product Teams increase empathy, and what impact does empath...
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software. They hope to capture value from emerging technologies such as IoT, SDN, and AI. Ultimately, irrespective of the vertical, it is about deriving value from independent software applications participating in an ecosystem as one comprehensive solution. In his session at @ThingsExpo, Kausik Sridhar, founder and CTO of Pulzze Systems, discussed how given the magnitude of today's application ...
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
You know you need the cloud, but you're hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You're looking at private cloud solutions based on hyperconverged infrastructure, but you're concerned with the limits inherent in those technologies. What do you do?
Sanjeev Sharma Joins June 5-7, 2018 @DevOpsSummit at @Cloud Expo New York Faculty. Sanjeev Sharma is an internationally known DevOps and Cloud Transformation thought leader, technology executive, and author. Sanjeev's industry experience includes tenures as CTO, Technical Sales leader, and Cloud Architect leader. As an IBM Distinguished Engineer, Sanjeev is recognized at the highest levels of IBM's core of technical leaders.
Recently, WebRTC has a lot of eyes from market. The use cases of WebRTC are expanding - video chat, online education, online health care etc. Not only for human-to-human communication, but also IoT use cases such as machine to human use cases can be seen recently. One of the typical use-case is remote camera monitoring. With WebRTC, people can have interoperability and flexibility for deploying monitoring service. However, the benefit of WebRTC for IoT is not only its convenience and interopera...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...