Welcome!

Blog Feed Post

OpenStack network mystery: How 2 bytes cost me hours of trouble

Once upon a time, I set up an OpenStack cluster and experienced some strange connectivity problems with all my OpenStack instances. It was the perfect opportunity to learn more about OpenStack, perform a head-long deep dive into Neutron, and update my network troubleshooting skills.

The preliminaries

I set up my OpenStack cluster on 6 Dell OptiPlex 7040 machines, 1 router, 1 managed switch, and 1 unmanaged switch. True, this is not the usual enterprise hardware you would want for a production cluster, but I tend to take things literally, so I set up OpenStack on commodity hardware. In any event, my goal was to set up a portable OpenStack cluster for demo purposes, not for production use cases.

OpenStack Cluster

Since my cluster nodes only have one NIC and I wanted to have a multi-NIC setup, I considered my options and settled on a USB 3.0 to Gigabit Ethernet adapter as a second NIC. The adapter supports 802.1Q,  checksum offloading, has drivers for Linux kernel 4.x/3.x/2.6.x, and it’s received positive reviews for running on Linux. Once everything was set up and the wiring was complete, I started the installation of Mirantis OpenStack 9.1 from a USB stick on my designated Fuel master node.

Hey, ho, let’s go!

My OpenStack cluster has an administration network that is also used for PXE on the onboard NIC. The second NIC is used for private, storage, and public networking via VLANs. Following the successful OpenStack deployment with Fuel, I started an instance on the private network using the cirrOS image to confirm that everything was working. I assigned a floating IP and was able to connect via SSH. I checked the network configuration and then decided to view the running processes using ps faux. After executing the command, the terminal showed some processes, but then the connection froze and became unresponsive. I killed the SSH session on my machine rather than wait for a timeout.

Go-Go Gadget troubleshooting

Suspecting that a series of unfortunate events had simply occurred, I tried again. Unfortunately, I got the same result: SSH, ps faux, freeze… Damn! I was able to ping the instance, connect to it, interact with it, but it seemed that the large amount of data I was attempting to transfer was breaking the connection. I started to investigate by pinging the instance again using different packet sizes. I received the following result:

Ping results using different packet sizes

Interestingly, pings with large payloads weren’t answered. It was time to get to the bottom of this issue using some Network 101 analysis:

  • Your average Ethernet frame is 1514 bytes long
  • The Ethernet header takes up 14 bytes, which leaves 1500 bytes for payload
  • The IP header is 20 bytes long
  • The ICMP header is 8 bytes
  • That should leave exactly 1472 bytes (1500 – 20 – 8 = 1472) for ICMP payload
Anatomy of an Ethernet frame

The odd thing is, a payload of 1470 bytes works while a payload of 1472 bytes doesn’t work. This is only a 2-byte difference, which makes no sense. There is clearly a problem with large packets, but how on earth can it be off by 2 bytes? 2 bytes is essentially nothing; a VLAN header has 4 bytes and the link layer takes care of VLANs, so this doesn’t affect the payload size. Maybe IP options are the culprit? Nope, the IP header must be a multiple of 4 bytes. 2 bytes is simply inexplicable from a network protocol point of view. It’s time to send in tcpdump and Wireshark to the rescue.

Bring in the big guns

I start up Wireshark on my laptop and again ping the floating IP address of the OpenStack instance. No additional information; the last ping with a payload of 1472 bytes doesn’t receive a response.

Ping analysis with Wireshark

The next step is to run tcpdump on the OpenStack controller node that also runs Neutron, the OpenStack networking component in my setup. Neutron takes care of capturing, NATing, and forwarding the ICMP message to the instance. To understand all the details of OpenStack networking in detail, please check out the official oOpenStack Networking Guide. In short, OpenStack networking is a lot like Venice—there are masquerades and bridges all over the place!

Joking aside, the tcpdump on the controller reveals no further information.

Tcpdump of ping on Neutron node

However, it does show how many hops it takes to ping an OpenStack instance. The ICMP packet traverses the physical network from my laptop (10.0.0.100) to the floating IP address of the instance (10.0.0.131), several virtual interfaces and bridges, until it reaches the OpenStack instance on its private IP address (192.168.10.101), which in turn sends the response back to my laptop along the same route.

Tcpdump of ping with payload sizes 1470 and 1472 (+8 bytes ICMP header, yields lengths of 1478 and 1480)

Once again, the ping with 1470 bytes of payload receives a response while the 1472-byte payload remains unanswered. However, we get some additional information. The packet vanishes before the destination address is changed to the private IP address of the instance and the packet is forwarded to the compute node that runs the instance. I checked the Neutron log files, Nova log files, and syslog, but couldn’t find anything. Eventually, I found something interesting in the kern.log file.

Kernel log file showing dropped packets

Remember the 2-byte difference we identified earlier? Here they are, causing the kernel to drop the packets silently because they’re too long. Now we have evidence that something is messing up the network packets. However, the questions remain; why and how? Analyzing the tcpdump with Wireshark sheds more light on the problem. Can you spot it in the image below?

Wireshark view of previous tcpdump

First, the destination MAC address is empty because we used tcpdump on “any” device. In that mode, tcpdump doesn’t capture the link-layer header correctly. Instead, it supplies a fake header. Secondly, the Ethernet frame No. 1 has a length of 1518 bytes, which is odd given the 1470 bytes of ICMP payload, 8 bytes of ICMP header, 20 bytes of IP header, and 18 bytes of Ethernet + VLAN header (1470 + 8 + 20 + 18 = 1516 bytes).

Have a look at the line VSS-Monitoring ethernet trailer, Source Port: 9599 in the Wireshark screenshot above, just below the yellow highlighted ICMP line.

Trailing bytes

Finally, we’ve found the 2 additional bytes! But, where do they come from? Google tells us that this can be caused by the padding of packets at the network-driver level. I reconsidered my setup and identified the USB network adapter as the weakest link. To be honest, I suspected this might be the issue from the beginning, but I never imagined it would catch up with me in this way.

I downloaded and built the latest version of the driver and replaced the kernel module. Lo and behold, all my networking problems were gone! Pings of arbitrary payload sizes, SSH sessions, and file transfers all suddenly worked. In the end, my networking issues were caused by an issue with the driver that ships by default with the Linux kernel.

Lessons learned

I learned a lot while troubleshooting this issue. Here are my insights:

  • First and foremost, use recommended/certified hardware for OpenStack and follow the recommendations of your distribution of choice. I spent a lot of time chasing down a bug that would have been avoided if I’d used appropriate hardware.
  • Knowing computer networks and understanding OpenStack Neutron is mandatory for troubleshooting. Get familiar with the technologies you’re using.
  • OpenStack works. There are huge setups out there working in production. When something doesn’t work as expected, the issue is likely not with OpenStack or its services.
  • When troubleshooting, trust your experience and gut feelings!

How Dynatrace could have saved me time

If I’d used Dynatrace to troubleshoot this issue I’d have seen the impact on network quality on all of my monitored VMs. Additionally, I could have enabled Dynatrace log analytics to assist with troubleshooting from a log-analysis perspective. When your gut tells you that there may be troubles with MTU coming your way, you can proactively add the /var/log/kern.log file to Dynatrace log analytics and create a pattern-recognition rule (for example, over-mtu packet). With this approach, I could have received a notification each time this pattern appeared in the log files and I would have instantly known where to look for errors in the configuration… or in the network drivers.

The post OpenStack network mystery: How 2 bytes cost me hours of trouble appeared first on Dynatrace blog – monitoring redefined.

Read the original blog entry...

More Stories By Dynatrace Blog

Building a revolutionary approach to software performance monitoring takes an extraordinary team. With decades of combined experience and an impressive history of disruptive innovation, that’s exactly what we ruxit has.

Get to know ruxit, and get to know the future of data analytics.

Latest Stories
The next XaaS is CICDaaS. Why? Because CICD saves developers a huge amount of time. CD is an especially great option for projects that require multiple and frequent contributions to be integrated. But… securing CICD best practices is an emerging, essential, yet little understood practice for DevOps teams and their Cloud Service Providers. The only way to get CICD to work in a highly secure environment takes collaboration, patience and persistence. Building CICD in the cloud requires rigorous a...
SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
Data scientists must access high-performance computing resources across a wide-area network. To achieve cloud-based HPC visualization, researchers must transfer datasets and visualization results efficiently. HPC clusters now compute GPU-accelerated visualization in the cloud cluster. To efficiently display results remotely, a high-performance, low-latency protocol transfers the display from the cluster to a remote desktop. Further, tools to easily mount remote datasets and efficiently transfer...
Companies are harnessing data in ways we once associated with science fiction. Analysts have access to a plethora of visualization and reporting tools, but considering the vast amount of data businesses collect and limitations of CPUs, end users are forced to design their structures and systems with limitations. Until now. As the cloud toolkit to analyze data has evolved, GPUs have stepped in to massively parallel SQL, visualization and machine learning.
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, will provide a fun and simple way to introduce Machine Leaning to anyone and everyone. Together we will solve a machine learning problem and find an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intellige...
SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
SYS-CON Events announced today that TidalScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale is the leading provider of Software-Defined Servers that bring flexibility to modern data centers by right-sizing servers on the fly to fit any data set or workload. TidalScale’s award-winning inverse hypervisor technology combines multiple commodity servers (including their ass...
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, will discuss how they b...
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
Amazon is pursuing new markets and disrupting industries at an incredible pace. Almost every industry seems to be in its crosshairs. Companies and industries that once thought they were safe are now worried about being “Amazoned.”. The new watch word should be “Be afraid. Be very afraid.” In his session 21st Cloud Expo, Chris Kocher, a co-founder of Grey Heron, will address questions such as: What new areas is Amazon disrupting? How are they doing this? Where are they likely to go? What are th...
SYS-CON Events announced today that IBM has been named “Diamond Sponsor” of SYS-CON's 21st Cloud Expo, which will take place on October 31 through November 2nd 2017 at the Santa Clara Convention Center in Santa Clara, California.
We all know that end users experience the Internet primarily with mobile devices. From an app development perspective, we know that successfully responding to the needs of mobile customers depends on rapid DevOps – failing fast, in short, until the right solution evolves in your customers' relationship to your business. Whether you’re decomposing an SOA monolith, or developing a new application cloud natively, it’s not a question of using microservices – not doing so will be a path to eventual b...
Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.
In his session at 21st Cloud Expo, Michael Burley, a Senior Business Development Executive in IT Services at NetApp, will describe how NetApp designed a three-year program of work to migrate 25PB of a major telco's enterprise data to a new STaaS platform, and then secured a long-term contract to manage and operate the platform. This significant program blended the best of NetApp’s solutions and services capabilities to enable this telco’s successful adoption of private cloud storage and launchi...