|By Mehdi Daoudi||
|October 23, 2016 11:27 PM EDT|
What was supposed to be a quiet Friday suddenly turned into a real “Black Friday” for us (as well as most of the Internet) when Dyn suffered a major DDOS attack. From an internet disruption’s perspective, the widespread damage the outage caused made it the worst I have ever experienced.
At the core of it all, the managed DNS provider Dyn was targeted in a DDOS attack that impacted thousands of web properties, services, SaaS providers, and more.
The chart below shows the DNS resolution time and availability of twitter.com from around the world. There were three clear waves of outages:
- 7:10 EST to 9:10 EST
- 11:52 EST to 16:33 EST
- 19:13 EST to 20:38 EST
http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/DNS-Twit... 300w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/DNS-Twit... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/DNS-Twit... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/DNS-Twit... 1453w" sizes="(max-width: 625px) 100vw, 625px" />
The DNS failures were the result of Dyn nameservers not responding to DNS queries for more than four seconds.
We were impacted in three ways:
- Our domain Catchpoint.com was not reachable for a solid 30 minutes until we introduced our secondary managed DNS provider Verisign. We also brought up and publicized to our customers a backup domain that was never on Dyn, so our customers could login to our portal and keep an eye on their online services. All of these were in standby mode prior to the incident.
- Our nodes could not reliably talk to our globally distributed command and control systems until we switched to IP only mode, bypassing DNS lookups. This was a feature we had developed, tested, and in production, but was not active as our engineering teams planned one more enhancement. Due to the nature of the situation, we deemed the enhancement to be lower risk than what we were experiencing.
- Many of our own third party vendors that our company relies on stopped working- Customer Support and Online Help solution, CRM, office door badging system, SSO, 2 Factor Authentication services, one of the CDNs, a file sharing solution, and the list goes on and on.
This blog post is not about finger pointing; the folks at Dyn had a horrible day putting up with their worst nightmare. They did an amazing job of dealing with it, from notifications to extinguishing the fire. This is about how to deal with the worst case outage, as a company and an industry.
As with every outage, it’s important to take the time to reflect on what took place and how this can be avoided in the future.
Here are some of my takeaways from Friday, and the must-have solutions:
- DNS is still one of the weakest links in our Internet infrastructure and digital economy. We have to keep learning and sharing that knowledge with each other. Here are several articles we have written on DNS.
- A single DNS provider is not an option anymore for anyone. No company, small or large, can rely on a single DNS provider.
- DNS vendors should create knowledge base articles about how to introduce secondary DNS providers, and they must be easy to find and follow.
- DNS vendors need to make the setup of auto – transfer easier to find. Having to open a ticket in a middle of a crisis to find out the IP of the xtransfer name servers is simply not a viable option.
- DNS Vendors should not set high TTLs (two days) on the authoritative nameserver records they pass on the DNS queries, and it should be easy to drop or change TTL. While this is great to bypass changing records on the TLDs, making the nameservers authoritative for two days becomes a headache when trying to switch to or migrate from a back-up solution.
http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image001... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image001... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image001... 885w" sizes="(max-width: 300px) 100vw, 300px" />
Introducing another DNS vendor wouldn’t have achieved 100% of the result until you go into the Dyn configuration and add that other solution in the mix:
- http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image002... 768w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image002... 1024w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image002... 624w, http://assetsblogfly2.catchpoint.com/wp-content/uploads/2016/10/image002... 1429w" sizes="(max-width: 300px) 100vw, 300px" />The community must work together to come up with commercial or open source solutions to make DNS configurations compatible between vendors (this is for complex DNS setups like failover, geo load balancing, etc.). This is a no longer a nice-to-have, but a must-have.
- There needs to be a way to push registrar configurations faster. We need an emergency reload button at the registrar levels, but also at the Root and TLD levels. And this means someway to tell DNS resolvers to purge their cache. Waiting for two days is not going to work after this catastrophic event, just like having multiple DNS vendors active does not scale financially for most companies.
- Lastly, it is time for The Internet Engineering Task Force to take a very close look at the DNS standards and figure out how to make this key protocol more redundant and flexible to deal with the challenges we are facing today.
Some takeaways from a monitoring standpoint:
I had people tell me, “But Mehdi, I am not seeing a problem in my RUM.” When your site isn’t reachable, RUM won’t tell you anything because there is no user activity to show. This is why your monitoring strategy must include synthetic and RUM.
- DNS monitoring is critical to understand the “why.”
- DNS performance impacts web performance.
- The impact was so incredible, some sites that didn’t rely on Dyn still suffered outages or bad user experience. This is because they used third parties that did rely on Dyn.
We interact with many things on a daily basis (cars, cell phones, planes, hair dryers) that have some sort of certification. I urge whoever is responsible to consider the following:
- A ban on any Internet-connected device that does not force the change of default credential upon starting it. There shouldn’t Admin/Admin for anything including cameras, refrigerators, access points, routers, etc.
- A ban on accessing of such devices from any place on the Internet. There should be some limitation, either access through the provider interface or from local network.
- Consumers should also pressure the industry by not buying the products that aren’t safe. Maybe we need an “Internet Safety Rating” from a governmental agency or worldwide organization.
- A must-have feature on every home and SMB router, and access point is the ability to detect abnormal traffic/activity and turn it off or slow it down; sending thousands of DNS requests in a minute is not normal. We should learn from Microsoft and what they did with Windows XP to limit an infected host.
- Local ISPs must have capabilities to detect and stop rogue traffic.
Cybersecurity is dire. I hope this incident serves as a huge wake-up call for everyone. What happened Friday was a Code Blue event; we rely on the Internet for practically everything in society today, and it’s our job to do everything we can to protect it.
Thank you, Dyn, for the prompt response times to the support tickets, to Verisign for last-minute questions, our customers who were very patient and understanding, our entire support organization, and some special friends in major companies who offered a helping hand by providing some amazing advice around DNS.
Mehdi – Catchpoint CEO and Co-Founder
To learn more about how you can handle a major outage like this in the future, join our upcoming Ask Me Anything: OUTAGE! with VictorOps, Target, and Release Engineering Approaches.
What if you could build a web application that could support true web-scale traffic without having to ever provision or manage a single server? Sounds magical, and it is! In his session at 20th Cloud Expo, Chris Munns, Senior Developer Advocate for Serverless Applications at Amazon Web Services, will show how to build a serverless website that scales automatically using services like AWS Lambda, Amazon API Gateway, and Amazon S3. We will review several frameworks that can help you build serverle...
Mar. 22, 2017 06:30 PM EDT Reads: 1,015
Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, will provide a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services ...
Mar. 22, 2017 06:15 PM EDT Reads: 3,850
VeriStor Systems has announced that CRN has named VeriStor to its 2017 Managed Service Provider (MSP) 500 list in the Elite 150 category. This annual list recognizes North American solution providers with cutting-edge approaches to delivering managed services. Their offerings help companies navigate the complex and ever-changing landscape of IT, improve operational efficiencies, and maximize their return on IT investments. In today’s fast-paced business environments, MSPs play an important role...
Mar. 22, 2017 05:45 PM EDT Reads: 1,718
SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
Mar. 22, 2017 04:30 PM EDT Reads: 687
SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
Mar. 22, 2017 03:45 PM EDT Reads: 1,057
Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
Mar. 22, 2017 03:30 PM EDT Reads: 677
SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure network...
Mar. 22, 2017 03:15 PM EDT Reads: 436
My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
Mar. 22, 2017 02:45 PM EDT Reads: 2,153
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In his Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, will explore t...
Mar. 22, 2017 02:15 PM EDT Reads: 2,101
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
Mar. 22, 2017 02:00 PM EDT Reads: 727
SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.
Mar. 22, 2017 02:00 PM EDT Reads: 1,303
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
Mar. 22, 2017 01:30 PM EDT Reads: 8,051
Have you ever noticed how some IT people seem to lead successful, rewarding, and satisfying lives and careers, while others struggle? IT author and speaker Don Crawley uncovered the five principles that successful IT people use to build satisfying lives and careers and he shares them in this fast-paced, thought-provoking webinar. You'll learn the importance of striking a balance with technical skills and people skills, challenge your pre-existing ideas about IT customer service, and gain new in...
Mar. 22, 2017 01:15 PM EDT Reads: 1,806
"When you think about the data center today, there's constant evolution, The evolution of the data center and the needs of the consumer of technology change, and they change constantly," stated Matt Kalmenson, VP of Sales, Service and Cloud Providers at Veeam Software, in this SYS-CON.tv interview at 18th Cloud Expo, held June 7-9, 2016, at the Javits Center in New York City, NY.
Mar. 22, 2017 12:45 PM EDT Reads: 7,673
@DevOpsSummit at Cloud taking place June 6-8, 2017, at Javits Center, New York City, is co-located with the 20th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time to wait for long developm...
Mar. 22, 2017 12:00 PM EDT Reads: 2,600