Blog Feed Post

Hunting and Fixing AWS DynamoDB Client Scalability Issues on Tomcat

As a performance consultant, I get called on to address various performance issues. One of our recent scalability issues happened on a micro service exposing a REST API. The micro service runs on Apache Tomcat, on an AWS EC2 instance, in a VPC. It uses ehcache as in-memory cache and Dynamo DB as persistent data source. DynamoDB gets updates from the source system using data pipeline built in Kinesis and Lambda functions.

In this blog, I’ll walk through the steps taken by our performance engineering Melchor to analyze this scalability issue in our Performance Test environment, which tools were used, and how this problem was resolved. I hope you find this useful!

API Scalability Analysis in Performance Test Environment

Price API has a very tight SLA. 99% of requests processed within a minute must show a Response Time of < 20ms under a load of 2000 tps (transactions per second). To keep the number of EC2 instances and number of DynamoDB calls low, we decided for memory optimized EC2 instances and increased the JVM heap size to 100 GB to cache 80-90% of SKU price details in JVM. In performance test, we noticed that we can only meet this type of SLA if all request are served from the cache. During cache warmup, or in case more than 10% of items are not found in cache (Cache Misses), the service would miss its SLA.

The following diagram visualizes our service flow. Consumer API’s calls Price API to look up prices for multiple items and different location ids. The Price API validates if the requested data is in the ehcache. If not it pulls the data from DynamoDB using AWS DynamoDB Client library.

Service Flow when a consumer issues requests to the Price API Micro Service on AWS

To monitor individual service health, we log entry and exit calls of each service invocation in Splunk. We can see how much time is spent in Price API and DynamoDB calls. We also look at AWS CloudWatch metrics to validate response time from DynamoDB. We ensure that DynamoDB query time is between 3-4ms and that DynamoDB has enough read/write capacities as well.

Application Monitoring with Dynatrace AppMon

The Price API team also leverages Dynatrace AppMon for end-to end-transactional tracing and deep dive diagnostics. It is the team’s tool of choice because Dynatrace AppMon is already used for live production performance monitoring of all our services in production. Melchor used Dynatrace AppMon to analyze a spike in response time above the accepted SLA, as neither CloudWatch nor the custom-built logging via Splunk provided an answer to the Price API Team.

Next, let’s walk through the steps in Dynatrace AppMon to see how we identified high response time and its root cause. In case you want to try it on your own I suggest you:

Issue #1: Time Spent in RandomUUID method of servlet API

Once Dynatrace AppMon collects data you can decide whether to analyze in the Dynatrace AppMon Diagnostics Client or go directly to the Dynatrace AppMon Web interface. In the last blog on Hybrid Cloud Patterns we showed how we analyzed our PurePaths in the Web Interface.

In today’s example, we stick with the Dynatrace AppMon Diagnostics Client as we will perform thread dump analysis which is better to be done in that user interface.

Step 1: Analyze PurePaths to understand hotspots

Dynatrace AppMon captures every single PurePath of every single request that was executed. In our scenario we rely on the 100% transactional coverage because most of our transactions we consider slow (>20ms) are considered fast by other tools in the APM space. Other tools would therefore not capture all the details we need to optimize our critical transactions.

In Dynatrace AppMon we typically start by opening and looking at a PurePath. In the PurePath Tree there is a neat option that is called “show all nodes”. Now we not only see critical methods based on Dynatrace’s hotspot algorithm but we get to see every method executed including its execution time contribution and whether that time was spent on CPU, Sync, Garbage Collection or I/O. The following screenshot shows that extended PurePath Tree and it is easy to spot that the method taking most of the time was the nextBytes method. This method already spent 53.33ms getting a randomUUID in our servlet execution, without even reaching the business API code. Remember – our API SLA is 20ms – so we are already more than twice over the limit. We can also observe that nextBytes spends 95% of its time waiting to enter a synchronized code block instead of actually executing code!

The PurePath Tree shows complete transaction flow, executed methods and how long they took to execute. Easy to spot the problematic 53ms execution time of the servlet secureRandom class which also happens to be 95% synchronization time.

Step 2: Thread Diagnostics to understand dependencies

At this point, we decided to take thread dumps and determine why nextBytes method in SecureRandom class is taking that much time in sync.

Fortunately, Dynatrace AppMon comes with a built-in thread dump analysis feature. Thread dumps can either be triggered on demand, scheduled or triggered by an event. After we executed a thread dump we could immediately see what all threads were doing, and whether they are blocked by other threads.

Dynatrace AppMon comes with a built-in thread diagnostics feature to analyze what threads are doing and how they are cross impacting each other.

It turned out that many Tomcat http-nio and threadPoolTaskExecutor (used for calling DynamoDB asynchronously) threads were blocked because of a single thread executing nextBytes, which is a thread safe synchronized method. All the incoming traffic will pass through this bottleneck since getting a secure SSL connection will use nextBytes (synchronized method) to obtain a secure random thus blocking Tomcat threads.

Also all async threads that call DynamoDB (threadPoolTaskExecutor) will end up blocked since AWS DynamoDB client library requires a randomUUID, and will use the same secure random implementation defined in the java.security of Tomcat.

The Price API also accepts multiple SKUs in one HTTP request, but queries DynamoDB for each SKU in single get requests (sounds like the classical N+1 Query pattern that Andreas Grabner has been talking about). During the services warm up phase, or when we see more than 10% cache misses, the number of nextBytes method invocations increases exponentially by both Tomcat’s and the async threadPoolTaskExecutor threads. Since nextBytes is a synchronized thread safe method we see a huge increase in wait time for all other concurrent invocations of nextBytes. Simliar to the PurePath tree, we can also analyze the full call stack for each thread in the dump – showing us who is really calling into these synchronized methods.

When analyzing Thread Dumps we also get to see the full stack trace for every thread. This helps to understand who calls nextBytes

Dynatrace also provides a useful feature of “decompiling source code”. Right from the PurePaths, or from the Thread Dumps, we can get the decompiled version of every single method on the call stack. The following shows us the synchronized nextBytes method:

Dynatrace provides a neat feature called “decompile source code”. Makes it easier to understand what happens within methods we do not have source code access to.

Solution to Issue #1: Time Spent in RandomUUID method of servlet API

We did some digging in the source code of JDK 1.6. Turns out that SecureRandom will seed itself /dev/random or /dev/urnadom. We used strace to identify which source was used in our case. It was /dev/random/. If you want to learn more about this please find more details in the following two links: https://linux.die.net/man/4/random, http://man7.org/linux/man-pages/man4/random.4.html

How we Fixed it

In the $JAVA_HOME/jre/lib/security/java.security configuration file we changed the securerandom from /dev/random to /dev/./urandom which is much faster and does not block the threads as easily as random does.


This can also be achieved by adding the following parameter in the JVM command line


This change allowed our API to operate within define 20ms SLA because we completely eliminated the synchronization overhead!

Issue #2: AWS DynamoDB client metadata cache

After fixing the RandomUUID bottleneck we soon started to see blocked threads again. This time for a different reason. The approach to identify it was similar though.

Step 1: Thread Diagnostics

We went back to creating Thread Dumps using Dynatrace AppMon, which quickly showed us why threads are getting blocked. This time it was due to the add method in ResponseMetaDataCache class in AmazonDynamoDB client library.

The high level thread dump analysis showed us that more than 50% of our threads were in blocking state.
Looking at the stack trace showed us that the calls ending up waiting originate in the AmazonHttpClient library.

Solution to Issue #2: Time Spent in AWS dynamoDB client

The default behavior of the Amazon AWS Http Client libraries is to cache response metadata for troubleshooting. For more details check out setCacheResponseMetadata in the AWS Doc.

We changed the behavior to false to prevent this bottleneck when we made calls to DynamoDB through the Amazon Client Library.

Our code change to change the default cache behavior of the Amazon DynamoDB Client Library.

Performance after both fixes

After implementing the two fixes described above, Price API could handle peak load within SLA. Thread Dumps during the test showed no blocked threads as well.

No more blocking threads after applying both fixes

And the PurePaths also looked much better!

Transaction Response Time was now within our SLAs as validated through the PurePaths

Thanks again to Melchor for sharing this story. It shows us that good performance engineers not only understand how to analyze performance issues, but also work with the underlying frameworks and the engineering team to come up with the right solution. It also showed us that even though we built custom log-based monitoring we could only find and fix it thanks to Dynatrace.

If you want to learn more about how Dynatrace can help feel free to get your own Dynatrace AppMon Personal License or try our Dynatrace SaaS offering with Full Stack Cloud and Container monitoring support.

The post Hunting and Fixing AWS DynamoDB Client Scalability Issues on Tomcat appeared first on Dynatrace blog – monitoring redefined.

Read the original blog entry...

More Stories By Dynatrace Blog

Building a revolutionary approach to software performance monitoring takes an extraordinary team. With decades of combined experience and an impressive history of disruptive innovation, that’s exactly what we ruxit has.

Get to know ruxit, and get to know the future of data analytics.

Latest Stories
To get the most out of their data, successful companies are not focusing on queries and data lakes, they are actively integrating analytics into their operations with a data-first application development approach. Real-time adjustments to improve revenues, reduce costs, or mitigate risk rely on applications that minimize latency on a variety of data sources. In his session at @BigDataExpo, Jack Norris, Senior Vice President, Data and Applications at MapR Technologies, reviewed best practices to ...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
Cloud Expo | DXWorld Expo have announced the conference tracks for Cloud Expo 2018. Cloud Expo will be held June 5-7, 2018, at the Javits Center in New York City, and November 6-8, 2018, at the Santa Clara Convention Center, Santa Clara, CA. Digital Transformation (DX) is a major focus with the introduction of DX Expo within the program. Successful transformation requires a laser focus on being data-driven and on using all the tools available that enable transformation if they plan to survive ov...
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to close th...
Continuous Delivery makes it possible to exploit findings of cognitive psychology and neuroscience to increase the productivity and happiness of our teams. In his session at 22nd Cloud Expo | DXWorld Expo, Daniel Jones, CTO of EngineerBetter, will answer: How can we improve willpower and decrease technical debt? Is the present bias real? How can we turn it to our advantage? Can you increase a team’s effective IQ? How do DevOps & Product Teams increase empathy, and what impact does empath...
As many know, the first generation of Cloud Management Platform (CMP) solutions were designed for managing virtual infrastructure (IaaS) and traditional applications. But that's no longer enough to satisfy evolving and complex business requirements. In his session at 21st Cloud Expo, Scott Davis, Embotics CTO, explored how next-generation CMPs ensure organizations can manage cloud-native and microservice-based application architectures, while also facilitating agile DevOps methodology. He expla...
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software. They hope to capture value from emerging technologies such as IoT, SDN, and AI. Ultimately, irrespective of the vertical, it is about deriving value from independent software applications participating in an ecosystem as one comprehensive solution. In his session at @ThingsExpo, Kausik Sridhar, founder and CTO of Pulzze Systems, discussed how given the magnitude of today's application ...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve f...
"Digital transformation - what we knew about it in the past has been redefined. Automation is going to play such a huge role in that because the culture, the technology, and the business operations are being shifted now," stated Brian Boeggeman, VP of Alliances & Partnerships at Ayehu, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.