Welcome!

Blog Feed Post

OpenStack network mystery: How 2 bytes cost me hours of trouble

Once upon a time, I set up an OpenStack cluster and experienced some strange connectivity problems with all my OpenStack instances. It was the perfect opportunity to learn more about OpenStack, perform a head-long deep dive into Neutron, and update my network troubleshooting skills.

The preliminaries

I set up my OpenStack cluster on 6 Dell OptiPlex 7040 machines, 1 router, 1 managed switch, and 1 unmanaged switch. True, this is not the usual enterprise hardware you would want for a production cluster, but I tend to take things literally, so I set up OpenStack on commodity hardware. In any event, my goal was to set up a portable OpenStack cluster for demo purposes, not for production use cases.

OpenStack Cluster

Since my cluster nodes only have one NIC and I wanted to have a multi-NIC setup, I considered my options and settled on a USB 3.0 to Gigabit Ethernet adapter as a second NIC. The adapter supports 802.1Q,  checksum offloading, has drivers for Linux kernel 4.x/3.x/2.6.x, and it’s received positive reviews for running on Linux. Once everything was set up and the wiring was complete, I started the installation of Mirantis OpenStack 9.1 from a USB stick on my designated Fuel master node.

Hey, ho, let’s go!

My OpenStack cluster has an administration network that is also used for PXE on the onboard NIC. The second NIC is used for private, storage, and public networking via VLANs. Following the successful OpenStack deployment with Fuel, I started an instance on the private network using the cirrOS image to confirm that everything was working. I assigned a floating IP and was able to connect via SSH. I checked the network configuration and then decided to view the running processes using ps faux. After executing the command, the terminal showed some processes, but then the connection froze and became unresponsive. I killed the SSH session on my machine rather than wait for a timeout.

Go-Go Gadget troubleshooting

Suspecting that a series of unfortunate events had simply occurred, I tried again. Unfortunately, I got the same result: SSH, ps faux, freeze… Damn! I was able to ping the instance, connect to it, interact with it, but it seemed that the large amount of data I was attempting to transfer was breaking the connection. I started to investigate by pinging the instance again using different packet sizes. I received the following result:

Ping results using different packet sizes

Interestingly, pings with large payloads weren’t answered. It was time to get to the bottom of this issue using some Network 101 analysis:

  • Your average Ethernet frame is 1514 bytes long
  • The Ethernet header takes up 14 bytes, which leaves 1500 bytes for payload
  • The IP header is 20 bytes long
  • The ICMP header is 8 bytes
  • That should leave exactly 1472 bytes (1500 – 20 – 8 = 1472) for ICMP payload
Anatomy of an Ethernet frame

The odd thing is, a payload of 1470 bytes works while a payload of 1472 bytes doesn’t work. This is only a 2-byte difference, which makes no sense. There is clearly a problem with large packets, but how on earth can it be off by 2 bytes? 2 bytes is essentially nothing; a VLAN header has 4 bytes and the link layer takes care of VLANs, so this doesn’t affect the payload size. Maybe IP options are the culprit? Nope, the IP header must be a multiple of 4 bytes. 2 bytes is simply inexplicable from a network protocol point of view. It’s time to send in tcpdump and Wireshark to the rescue.

Bring in the big guns

I start up Wireshark on my laptop and again ping the floating IP address of the OpenStack instance. No additional information; the last ping with a payload of 1472 bytes doesn’t receive a response.

Ping analysis with Wireshark

The next step is to run tcpdump on the OpenStack controller node that also runs Neutron, the OpenStack networking component in my setup. Neutron takes care of capturing, NATing, and forwarding the ICMP message to the instance. To understand all the details of OpenStack networking in detail, please check out the official oOpenStack Networking Guide. In short, OpenStack networking is a lot like Venice—there are masquerades and bridges all over the place!

Joking aside, the tcpdump on the controller reveals no further information.

Tcpdump of ping on Neutron node

However, it does show how many hops it takes to ping an OpenStack instance. The ICMP packet traverses the physical network from my laptop (10.0.0.100) to the floating IP address of the instance (10.0.0.131), several virtual interfaces and bridges, until it reaches the OpenStack instance on its private IP address (192.168.10.101), which in turn sends the response back to my laptop along the same route.

Tcpdump of ping with payload sizes 1470 and 1472 (+8 bytes ICMP header, yields lengths of 1478 and 1480)

Once again, the ping with 1470 bytes of payload receives a response while the 1472-byte payload remains unanswered. However, we get some additional information. The packet vanishes before the destination address is changed to the private IP address of the instance and the packet is forwarded to the compute node that runs the instance. I checked the Neutron log files, Nova log files, and syslog, but couldn’t find anything. Eventually, I found something interesting in the kern.log file.

Kernel log file showing dropped packets

Remember the 2-byte difference we identified earlier? Here they are, causing the kernel to drop the packets silently because they’re too long. Now we have evidence that something is messing up the network packets. However, the questions remain; why and how? Analyzing the tcpdump with Wireshark sheds more light on the problem. Can you spot it in the image below?

Wireshark view of previous tcpdump

First, the destination MAC address is empty because we used tcpdump on “any” device. In that mode, tcpdump doesn’t capture the link-layer header correctly. Instead, it supplies a fake header. Secondly, the Ethernet frame No. 1 has a length of 1518 bytes, which is odd given the 1470 bytes of ICMP payload, 8 bytes of ICMP header, 20 bytes of IP header, and 18 bytes of Ethernet + VLAN header (1470 + 8 + 20 + 18 = 1516 bytes).

Have a look at the line VSS-Monitoring ethernet trailer, Source Port: 9599 in the Wireshark screenshot above, just below the yellow highlighted ICMP line.

Trailing bytes

Finally, we’ve found the 2 additional bytes! But, where do they come from? Google tells us that this can be caused by the padding of packets at the network-driver level. I reconsidered my setup and identified the USB network adapter as the weakest link. To be honest, I suspected this might be the issue from the beginning, but I never imagined it would catch up with me in this way.

I downloaded and built the latest version of the driver and replaced the kernel module. Lo and behold, all my networking problems were gone! Pings of arbitrary payload sizes, SSH sessions, and file transfers all suddenly worked. In the end, my networking issues were caused by an issue with the driver that ships by default with the Linux kernel.

Lessons learned

I learned a lot while troubleshooting this issue. Here are my insights:

  • First and foremost, use recommended/certified hardware for OpenStack and follow the recommendations of your distribution of choice. I spent a lot of time chasing down a bug that would have been avoided if I’d used appropriate hardware.
  • Knowing computer networks and understanding OpenStack Neutron is mandatory for troubleshooting. Get familiar with the technologies you’re using.
  • OpenStack works. There are huge setups out there working in production. When something doesn’t work as expected, the issue is likely not with OpenStack or its services.
  • When troubleshooting, trust your experience and gut feelings!

How Dynatrace could have saved me time

If I’d used Dynatrace to troubleshoot this issue I’d have seen the impact on network quality on all of my monitored VMs. Additionally, I could have enabled Dynatrace log analytics to assist with troubleshooting from a log-analysis perspective. When your gut tells you that there may be troubles with MTU coming your way, you can proactively add the /var/log/kern.log file to Dynatrace log analytics and create a pattern-recognition rule (for example, over-mtu packet). With this approach, I could have received a notification each time this pattern appeared in the log files and I would have instantly known where to look for errors in the configuration… or in the network drivers.

The post OpenStack network mystery: How 2 bytes cost me hours of trouble appeared first on Dynatrace blog – monitoring redefined.

Read the original blog entry...

More Stories By Dynatrace Blog

Building a revolutionary approach to software performance monitoring takes an extraordinary team. With decades of combined experience and an impressive history of disruptive innovation, that’s exactly what we ruxit has.

Get to know ruxit, and get to know the future of data analytics.

Latest Stories
When growing capacity and power in the data center, the architectural trade-offs between server scale-up vs. scale-out continue to be debated. Both approaches are valid: scale-out adds multiple, smaller servers running in a distributed computing model, while scale-up adds fewer, more powerful servers that are capable of running larger workloads. It’s worth noting that there are additional, unique advantages that scale-up architectures offer. One big advantage is large memory and compute capacity...
A look across the tech landscape at the disruptive technologies that are increasing in prominence and speculate as to which will be most impactful for communications – namely, AI and Cloud Computing. In his session at 20th Cloud Expo, Curtis Peterson, VP of Operations at RingCentral, highlighted the current challenges of these transformative technologies and shared strategies for preparing your organization for these changes. This “view from the top” outlined the latest trends and developments i...
Artificial intelligence, machine learning, neural networks. We’re in the midst of a wave of excitement around AI such as hasn’t been seen for a few decades. But those previous periods of inflated expectations led to troughs of disappointment. Will this time be different? Most likely. Applications of AI such as predictive analytics are already decreasing costs and improving reliability of industrial machinery. Furthermore, the funding and research going into AI now comes from a wide range of com...
It is ironic, but perhaps not unexpected, that many organizations who want the benefits of using an Agile approach to deliver software use a waterfall approach to adopting Agile practices: they form plans, they set milestones, and they measure progress by how many teams they have engaged. Old habits die hard, but like most waterfall software projects, most waterfall-style Agile adoption efforts fail to produce the results desired. The problem is that to get the results they want, they have to ch...
No hype cycles or predictions of zillions of things here. IoT is big. You get it. You know your business and have great ideas for a business transformation strategy. What comes next? Time to make it happen. In his session at @ThingsExpo, Jay Mason, Associate Partner at M&S Consulting, presented a step-by-step plan to develop your technology implementation strategy. He discussed the evaluation of communication standards and IoT messaging protocols, data analytics considerations, edge-to-cloud tec...
Cloud Expo, Inc. has announced today that Andi Mann and Aruna Ravichandran have been named Co-Chairs of @DevOpsSummit at Cloud Expo Silicon Valley which will take place Oct. 31-Nov. 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. "DevOps is at the intersection of technology and business-optimizing tools, organizations and processes to bring measurable improvements in productivity and profitability," said Aruna Ravichandran, vice president, DevOps product and solutions marketing...
"When we talk about cloud without compromise what we're talking about is that when people think about 'I need the flexibility of the cloud' - it's the ability to create applications and run them in a cloud environment that's far more flexible,” explained Matthew Finnie, CTO of Interoute, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Loom is applying artificial intelligence and machine learning into the entire log analysis process, from start to finish and at the end you will get a human touch,” explained Sabo Taylor Diab, Vice President, Marketing at Loom Systems, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
We build IoT infrastructure products - when you have to integrate different devices, different systems and cloud you have to build an application to do that but we eliminate the need to build an application. Our products can integrate any device, any system, any cloud regardless of protocol," explained Peter Jung, Chief Product Officer at Pulzze Systems, in this SYS-CON.tv interview at @ThingsExpo, held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend 21st Cloud Expo October 31 - November 2, 2017, at the Santa Clara Convention Center, CA, and June 12-14, 2018, at the Javits Center in New York City, NY, and learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
The Internet giants are fully embracing AI. All the services they offer to their customers are aimed at drawing a map of the world with the data they get. The AIs from these companies are used to build disruptive approaches that cannot be used by established enterprises, which are threatened by these disruptions. However, most leaders underestimate the effect this will have on their businesses. In his session at 21st Cloud Expo, Rene Buest, Director Market Research & Technology Evangelism at Ara...
In his session at @ThingsExpo, Eric Lachapelle, CEO of the Professional Evaluation and Certification Board (PECB), provided an overview of various initiatives to certify the security of connected devices and future trends in ensuring public trust of IoT. Eric Lachapelle is the Chief Executive Officer of the Professional Evaluation and Certification Board (PECB), an international certification body. His role is to help companies and individuals to achieve professional, accredited and worldwide re...
Wooed by the promise of faster innovation, lower TCO, and greater agility, businesses of every shape and size have embraced the cloud at every layer of the IT stack – from apps to file sharing to infrastructure. The typical organization currently uses more than a dozen sanctioned cloud apps and will shift more than half of all workloads to the cloud by 2018. Such cloud investments have delivered measurable benefits. But they’ve also resulted in some unintended side-effects: complexity and risk. ...
"We are a monitoring company. We work with Salesforce, BBC, and quite a few other big logos. We basically provide monitoring for them, structure for their cloud services and we fit into the DevOps world" explained David Gildeh, Co-founder and CEO of Outlyer, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...