Blog Feed Post

A Tale of Two Load Balancers

It was the best of load balancers, it was the worst of load balancers, it was the age of happy users, it was the age of frustrated users.

I get to see a variety of interesting network problems; sometimes these are first-hand, but more frequently now these are through our partner organization. Some are old hat; TCP window constraints on high latency networks remain at the top of that list. Others represent new twists on stupid network tricks, often resulting from external manipulation of TCP parameters for managing throughput (or shaping traffic). And occasionally – as in this example – there’s a bit of both.

Many thanks to Stefan Deml, co-founder and board member at amasol AG, Dynatrace’s Platinum Partner headquartered in Munich, Germany. Stefan and his team worked diligently and expertly with their customer to uncover – and fix – the elusive root cause of an ongoing performance complaint.

Problem brief

Users in North America connect to an application hosted in Germany. The app uses the SOAP protocol to request and deliver information. Users connect through a firewall and one of two Cisco ACE 30 load balancers to the first-tier WebLogic app servers.

When users connect through LB1, performance is good. When they connect through LB2, however, performance is quite poor. While the definition of “poor performance” varied depending on the type of transaction, the customer identified a 1.5MB test transaction that helped quantify the problem quite well: fast is 10 seconds, while slow is 60 seconds – or even longer.

EUE monitoring

Dynatrace DC RUM is used to monitor this customer’s application performance and user experience, alerting the IT team to the problem and quantifying the severity of user complaints. (When users complain that response time is measured in minutes rather than seconds, it’s helpful to have a solution that validates those claims with measured transaction response times.) DC RUM automatically isolated the problem to a network-related bottleneck, while proving that the network itself – as qualified by packet loss and congestion delay – was not to blame.

Time to dig a little deeper

I’ll use Dynatrace Network Analyzer – DNA, my protocol analyzer of choice – to examine the underlying behavior and identify the root cause of the problem, taking advantage of the luxury of having traces of both good and poor performing transactions.  I’ll skip DNA’s top-down analysis (I’m assuming you don’t care to see yet another Client/Network/Server pie chart), and dive directly into annotated packet-level Bounce Diagrams to illustrate the problem.

(DNA’s Bounce Diagram is simply a graphic of a trace file; each packet is represented by an arrow color-coded according to packet size.)

First, the fast transaction instance:

Bounce Diagram illustrating a fast instance of the test transaction through LB1; total elapsed time about 10 seconds.

For the fast transaction, most of the 10-second delay is allocated to server processing; the response download of 1.5MB takes about 1.7 seconds – about 7Mbps.

Here’s the same view of the slow transaction instance:

Bounce Diagram illustrating a slow instance of the test transaction through LB2; total elapsed time about 70 seconds.

There are two distinct performance differences between the fast transaction – the baseline – and this slow transaction. First, a dramatic increase in client request time (from 175 msec. to 52 seconds!); second, a smaller but still significant increase in response download time, from 1.7 seconds to 7.7 seconds.

The MSB (most significant bottleneck)

Let’s first examine the most significant bottleneck in the slow transaction. The client SOAP request – only 3KB – takes 54 seconds to transmit to the server, in 13 packets.

The packet trace shows the client sending very small packets, with gaps of about 5 seconds between. Examining the ACKs from LB2, we see that the TCP receive window size is unusually small; 254 bytes.

Packet trace excerpt showing LB2 advertising a window size of 256 bytes.

Such an unusually small window advertisement is generally a reliable indicator that TCP Window Scaling is active; without the SYN/SYN/ACK handshake, a protocol analyzer doesn’t know whether scaling is active, and is therefore unable to apply a scale factor to accurately interpret the window size field.

The customer did provide another trace that included the handshake, showing that the LB response to the client’s SYN does in fact include the Window Scaling option – with a scale factor of 0.

The SYN packet from LB2; window scaling will be supported, but LB2 will not scale it’s receive window.

Odd? Not really; this simply means that LB2 will allow the client to scale its receive window, but doesn’t intend to scale its own. The initial (non-scaled) receive window advertised by the LB is 32768. (It’s interesting to note that given a scale factor of 7, a receive window value of 256 would equal 32768.)

Once a few packets have been exchanged on the connection, however, LB2 abruptly reduces its receive window from 32768 to 254 – even though the client has only sent only a few hundred bytes. This is clearly not a result of the TCP socket’s buffer space filling up. Instead, it’s as if LB2 suddenly shifts to a non-zero scale factor (perhaps that factor of 7 I just suggested), even though it has already established a scale factor of zero.

Pop quiz: What to do with tiny windows?

Question: what should a TCP sender do when the peer TCP receive window falls below the MSS?

Answer: The sender should wait until the receiver’s window increases to a value greater than the MSS.

In practice, this means the sender waits for the receiver to empty its buffer. Given a receiver that is slow to read data from its buffer – and therefore advertises a small window of less than the MSS – it would be silly for the sender to send tiny packets just to fill the remaining space. In fact, this undesirable behavior is called the silly window syndrome, avoided through algorithms built into TCP.

For this reason, protocol analyzers and network probes should treat the occurrence of small (<MSS) window advertisements the same as zero window events, as they have the same performance impact.

When a receiver’s window is at zero for an extended period, a sender will typically send a window probe packet attempting to “wake up” the receiver. Of course, since the window is zero, no usable payload accompanies this window probe packet. In our example, the window is not zero, but the sender behavior is similar; the LB waits five seconds, then sends a small packet with just enough data (254 bytes) to fill the buffer. The ACK is immediate (the LB’s ACK frequency is 1), but the advertised window remains abnormally small. We can conclude that the LB believes it is advertising a full 32KB buffer, although it telling the client something much different.

After about 52 seconds, the 3K request reaches LB2, after which application processing occurs normally. It’s a good thing the request size wasn’t 30K!

The NSB (next significant bottleneck)

As is quite common, there’s another tuning opportunity – the NSB. This is highlighted by DC RUM’s metric called Server Realized Bandwidth, or download rate. The fast transaction transfers 1.5MB in about 1.6 seconds (7.5Mbps), while the slow transaction takes about 8 seconds for the same payload (1.5Mbps).

Could this be receiver flow control, or a small configured receive TCP window? These would seem reasonable theories – except that we’re using the same client for the tests. A quick look at the receiver’s TCP window proves this is not the case, as it remains at 131,072 (512 with a scaling factor of 9).

DNA’s Timeplot can graph a sender’s TCP Payload in Transit; comparing this with the receiver’s advertised TCP window can quickly prove – or disprove – a TCP window constraint theory.

Time plot showing LB2’s TCP payload in transit (bytes in flight) along with the client’s receive window size.

The maximum payload in transit for the slow transaction is about 32KB; given that the client’s receive window is much larger, we know that the client is not limiting throughput.

Let’s compare this with the fast transaction as it ramps up exponentially through TCP slow start:

Time plot showing LB1’s payload in transit as it ramps up through slow start.

It becomes clear that LB1 does not limit send throughput – bytes in flight – to 32KB, instead allowing the transfer to make more efficient use of the available bandwidth. We can conclude that some characteristic of LB2 is artificially limiting throughput.

Fixing the problems

For the MSB (most significant bottleneck), Cisco has identified a workaround (even if they might have slightly misstated the actual problem):

CSCud71628—HTTP performance across ACE is very bad. Packet captures show that ACE drops the TCP Window Size it advertises to the client to a very low value early in the connection and never recovers from this. Workaround: Disable the “tcp-options window-scale allow”.

For the NSB (next significant bottleneck), the LB configuration defaults to a TCP send buffer value of 32768K. Modifying the parameter set tcp buffer-share from the default 32768 to 262143 (the maximum permitted value) allowed for LB2 throughput to match that of LB1.

Wait; do you see the contradiction here? If we disable TCP window scaling, that would limit the effective TCP buffer to 65535, limiting the download transfer rate to under 4Mbps (given the existing link’s 130ms round-trip delay).

But this was the spring of hope; it seems that changing the tcp buffer-share parameter also solved the window scaling problem, without having to disable that option. This suggests a less-than obvious interaction between these parameters – but with happy users, we’ll take that bit of luck.

Is there more?

There are always additional NSBs; this is a tenet of performance tuning. We stop when the next bottleneck becomes insignificant (or when we have other problems to attend to). For this test transaction, the SOAP payload is rather large (1.5MB); while the payload is encrypted, it could still be compressed to reduce download time; a quick test using WinZip shows the potential for at least a 50% reduction.

While some of you will be quick to note that ACE has been discontinued, Cisco support for ACE will continue through January 2019.

The post A Tale of Two Load Balancers appeared first on Dynatrace blog – monitoring redefined.

Read the original blog entry...

More Stories By Dynatrace Blog

Building a revolutionary approach to software performance monitoring takes an extraordinary team. With decades of combined experience and an impressive history of disruptive innovation, that’s exactly what we ruxit has.

Get to know ruxit, and get to know the future of data analytics.

Latest Stories
DX World EXPO, LLC, a Lighthouse Point, Florida-based startup trade show producer and the creator of "DXWorldEXPO® - Digital Transformation Conference & Expo" has announced its executive management team. The team is headed by Levent Selamoglu, who has been named CEO. "Now is the time for a truly global DX event, to bring together the leading minds from the technology world in a conversation about Digital Transformation," he said in making the announcement.
"Space Monkey by Vivent Smart Home is a product that is a distributed cloud-based edge storage network. Vivent Smart Home, our parent company, is a smart home provider that places a lot of hard drives across homes in North America," explained JT Olds, Director of Engineering, and Brandon Crowfeather, Product Manager, at Vivint Smart Home, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Conference Guru has been named “Media Sponsor” of the 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. A valuable conference experience generates new contacts, sales leads, potential strategic partners and potential investors; helps gather competitive intelligence and even provides inspiration for new products and services. Conference Guru works with conference organizers to pass great deals to gre...
DevOps is under attack because developers don’t want to mess with infrastructure. They will happily own their code into production, but want to use platforms instead of raw automation. That’s changing the landscape that we understand as DevOps with both architecture concepts (CloudNative) and process redefinition (SRE). Rob Hirschfeld’s recent work in Kubernetes operations has led to the conclusion that containers and related platforms have changed the way we should be thinking about DevOps and...
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, demonstrated how to move beyond today's coding paradigm and shared the must-have mindsets for removing complexity from the develop...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
The next XaaS is CICDaaS. Why? Because CICD saves developers a huge amount of time. CD is an especially great option for projects that require multiple and frequent contributions to be integrated. But… securing CICD best practices is an emerging, essential, yet little understood practice for DevOps teams and their Cloud Service Providers. The only way to get CICD to work in a highly secure environment takes collaboration, patience and persistence. Building CICD in the cloud requires rigorous ar...
Companies are harnessing data in ways we once associated with science fiction. Analysts have access to a plethora of visualization and reporting tools, but considering the vast amount of data businesses collect and limitations of CPUs, end users are forced to design their structures and systems with limitations. Until now. As the cloud toolkit to analyze data has evolved, GPUs have stepped in to massively parallel SQL, visualization and machine learning.
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
"ZeroStack is a startup in Silicon Valley. We're solving a very interesting problem around bringing public cloud convenience with private cloud control for enterprises and mid-size companies," explained Kamesh Pemmaraju, VP of Product Management at ZeroStack, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Large industrial manufacturing organizations are adopting the agile principles of cloud software companies. The industrial manufacturing development process has not scaled over time. Now that design CAD teams are geographically distributed, centralizing their work is key. With large multi-gigabyte projects, outdated tools have stifled industrial team agility, time-to-market milestones, and impacted P&L stakeholders.
"Akvelon is a software development company and we also provide consultancy services to folks who are looking to scale or accelerate their engineering roadmaps," explained Jeremiah Mothersell, Marketing Manager at Akvelon, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Enterprises are adopting Kubernetes to accelerate the development and the delivery of cloud-native applications. However, sharing a Kubernetes cluster between members of the same team can be challenging. And, sharing clusters across multiple teams is even harder. Kubernetes offers several constructs to help implement segmentation and isolation. However, these primitives can be complex to understand and apply. As a result, it’s becoming common for enterprises to end up with several clusters. Thi...