Welcome!

Blog Feed Post

Hybrid Cloud Problem Patterns: Chasing DNS Lookup Times from AWS EC2

As a performance architect, I get called into various production performance issues. One of our recent production issues happened on Tomcat AppServer running on an AWS EC2 instance in a VPC. VPC is joined with an on-premise DNS server. This service calls another micro service. When service went live, we noticed a high response time from a downstream micro service, and the downstream service logs did not show any performance issue.

In this blog, I’ll walk through the steps taken by our tech arch Neeraj Verma to analyze this issue in our production environments, which tools were used, explaining background information DNS lookup, as well as how this problem was resolved. I hope you find this useful!

API High Response Time Analysis in Production Environment

Performance engineering is the science of discovering problem areas in applications under varying but realistic load conditions. It is not always easy to simulate real traffic and find all problems before going live. Therefore, it is advisable to determine out how to analyze performance problems, not only in test, but also in a real production environment. Having the right tools installed in production allows us to analyze issue and find root causes that are hard to simulate in testing.

The following diagram visualizes our service layout. Our eCommerce API’s call ShoppingHistory API using customerid. ShoppingHistory API calls DynamoDB and Customer- API-to-server requests from eCommerce.

Architectural Overview: Transactional Flow when the eCommerce Frontend calls the OrderAPI, and how it makes its way through different service layers deployed on AWS.

In order to monitor individual service health, we log entry and exit calls of each service invocation in a custom telemetry system. Each team uses Kibana/Grafana dashboards to measure health. Through the ShoppingHistory API dashboard, the team can see that time is being taken by Customer Service, even though the Customer Service dashboard did not show any issue at all. This is when the classical blame game would start. In our case we tasked the ShoppingHistory API team to find the actual root cause. And here is what we did.

Application Monitoring with Dynatrace AppMon

Our tool of choice was Dynatrace AppMon, which we already used to for live production performance monitoring of all our services in production. Let me walk you through the steps in Dynatrace AppMon on how we identified high response time and its root cause. In case you want to try it on your own I suggest you do the following:

  1. Get your own Dynatrace Personal License
  2. Watch the Dynatrace YouTube Tutorials
  3. Read up on Java Memory Management

Step #1: Basic Transaction

Once Dynatrace AppMon collects data you can decide whether to analyze it in the Dynatrace AppMon Diagnostics Client or go directly to the Dynatrace AppMon Web interface. With the recent improvements in Dynatrace AppMon 2017 May (v7) the Web Interface is even more convenient when analyzing PurePaths, which is why we go there. In the Web Interface we often start by looking at the Transaction Flow of our System. The Transaction Flow is dynamically generated by Dynatrace thanks to its capability to trace every single transaction, end-to-end, enabled through their PurePath technology.

Looking at the Transaction Flow, we could immediately see the most time (91%) was actually spent in ShoppingHistory JVM instead of Customer Service which we assumed until that point to be the problem as was indicated by our logging. Fortunately, Dynatrace AppMon told us otherwise!

The Dynatrace AppMon PurePath highlighted our Shopping History JVM as the response time hotspot.

Step 2: Drill Down into PurePath (show all nodes)

The detailed PurePath shows where most of the time is spent, down to the method itself. In our case we could spot that resolving the address of the backend microservice took about 2s. In the screenshot below you can see that when the frontend service tries to call the backend service it must first open a connection (HttpClient.connect method) which itself has to resolve the passed endpoint address. This method then calls the internal Java classes to do the actual DNS name resolution.

The PurePath Tree shows complete transaction flow, executed methods and how long they took to execute. Easy to spot the 2s execution time of the internal Java DNS name resolution method
The high level performance overview that you get for each PurePath also gives a good indication on which component our hotspot resides, clearly indicating the same problem — the amount of time spent making web request calls.

Solution

Based on the collected information from our production environment we tried finding a solution on the internet, and found an explanation for a similar issue on the IBM blog. This is where I found the answer to our issue:

The problem could be lookup issues between IPv6 versus IPv4. If the Domain Name System (DNS) server is not configured to handle IPv6 queries, the application may have to wait for the IPv6 query to timeout for IPv6 queries. By default, java.net.InetAddress resolutions requiring the network will be carried out over IPv6 if the operating system supports both IPv4 and IPv6. However, if the name service does not support IPv6, then a performance issue may be observed as the first IPv6 query has prolonged until its timeout before a successful IPv4 query can be made.

To reverse the default behavior and use IPv4 over IPv6, add the following Generic JVM argument:

  1. **-Djava.net.preferIPv4Stack=true**

We added -Djava.net.preferIPv4Stack=true  in JVM param and restarted JVM.

Now the transaction executed much faster, as shown in Dynatrace:

PurePath overview shows us a total execution time of 128ms. Most of the time now is spent in other areas – such as database calls – but no longer in resolving DNS addresses.
The PurePath Top Contributors tab makes this even clearer. HTTP calls now finish in milliseconds.
The API Breakdown also clearly shows that we solved this one problem. Now we can focus on the other hotspots if we want to improve performance further.

This story showed us how important it is to monitor all your applications and services in all the different environments on which they run. We will see a larger push towards hybrid cloud which means we have to find a way to detect these problems. Dynatrace natively supports all these technologies and, thanks to its analytics capabilities, makes it easy to find and fix them.

The post Hybrid Cloud Problem Patterns: Chasing DNS Lookup Times from AWS EC2 appeared first on Dynatrace blog – monitoring redefined.

Read the original blog entry...

More Stories By Dynatrace Blog

Building a revolutionary approach to software performance monitoring takes an extraordinary team. With decades of combined experience and an impressive history of disruptive innovation, that’s exactly what we ruxit has.

Get to know ruxit, and get to know the future of data analytics.

Latest Stories
SYS-CON Events announced today that N3N will exhibit at SYS-CON's @ThingsExpo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. N3N’s solutions increase the effectiveness of operations and control centers, increase the value of IoT investments, and facilitate real-time operational decision making. N3N enables operations teams with a four dimensional digital “big board” that consolidates real-time live video feeds alongside IoT sensor data a...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
Today most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes significant work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reducti...
Real IoT production deployments running at scale are collecting sensor data from hundreds / thousands / millions of devices. The goal is to take business-critical actions on the real-time data and find insights from stored datasets. In his session at @ThingsExpo, John Walicki, Watson IoT Developer Advocate at IBM Cloud, will provide a fast-paced developer journey that follows the IoT sensor data from generation, to edge gateway, to edge analytics, to encryption, to the IBM Bluemix cloud, to Wa...
What is the best strategy for selecting the right offshore company for your business? In his session at 21st Cloud Expo, Alan Winters, U.S. Head of Business Development at MobiDev, will discuss the things to look for - positive and negative - in evaluating your options. He will also discuss how to maximize productivity with your offshore developers. Before you start your search, clearly understand your business needs and how that impacts software choices.
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
Most of the time there is a lot of work involved to move to the cloud, and most of that isn't really related to AWS or Azure or Google Cloud. Before we talk about public cloud vendors and DevOps tools, there are usually several technical and non-technical challenges that are connected to it and that every company needs to solve to move to the cloud. In his session at 21st Cloud Expo, Stefano Bellasio, CEO and founder of Cloud Academy Inc., will discuss what the tools, disciplines, and cultural...
SYS-CON Events announced today that Fusic will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Fusic Co. provides mocks as virtual IoT devices. You can customize mocks, and get any amount of data at any time in your test. For more information, visit https://fusic.co.jp/english/.
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
SYS-CON Events announced today that Enroute Lab will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Enroute Lab is an industrial design, research and development company of unmanned robotic vehicle system. For more information, please visit http://elab.co.jp/.
SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
With the rise of DevOps, containers are at the brink of becoming a pervasive technology in Enterprise IT to accelerate application delivery for the business. When it comes to adopting containers in the enterprise, security is the highest adoption barrier. Is your organization ready to address the security risks with containers for your DevOps environment? In his session at @DevOpsSummit at 21st Cloud Expo, Chris Van Tuin, Chief Technologist, NA West at Red Hat, will discuss: The top security r...
IBM helps FinTechs and financial services companies build and monetize cognitive-enabled financial services apps quickly and at scale. Hosted on IBM Bluemix, IBM’s platform builds in customer insights, regulatory compliance analytics and security to help reduce development time and testing. In his session at 21st Cloud Expo, Lennart Frantzell, a Developer Advocate with IBM, will discuss how these tools simplify the time-consuming tasks of selection, mapping and data integration, allowing devel...
SYS-CON Events announced today that Mobile Create USA will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Mobile Create USA Inc. is an MVNO-based business model that uses portable communication devices and cellular-based infrastructure in the development, sales, operation and mobile communications systems incorporating GPS capabi...