|By Andreas Grabner||
|May 10, 2013 01:45 PM EDT||
Adding more memory to your JVMs (Java Virtual Machines) might be a temporary solution to fixing memory leaks in Java applications, but it for sure won't fix the root cause of the issue. Instead of crashing once per day it may just crash every other day. "Preventive" restarts are also just another desperate measure to minimize downtime, but, let's be frank: this is not how production issues should be solved.
One of our customers - a large online retail store - ran into such an issue. They run one of their online gift card self-service interfaces on two JVMs. During peak holiday seasons when users are activating their gift cards or checking the balance, crashes due to OOM (Out Of Memory) were more frequent, which caused bad user experience. The first "measure" they took was to double the JVM Heap Size. This didn't solve the problem as JVMs were still crashing, so they followed the memory diagnostics approach for production as explained in Java Memory Leaks to identify and fix the root cause of the problem.
Before we walk through the individual steps, let's look at the memory graph that shows the problems they had in December during the peak of the holiday season. The problem persisted even after increasing the memory. They could fix the problem after identifying the real root cause and applying specific configuration changes to a third-party software component.
After identifying the actual root cause and applying necessary configuration changes did the memory leak issue go away? Increasing Memory was not even a temporary solution that worked.
Step 1: Identify a Java Memory Leak
The first step is to monitor the JVM/CLR Memory Metrics such as Heap Space. This will tell us whether there is a potential memory leak. In this case we see memory usage constantly growing, resulting in an eventual runtime crash when the memory limit is reached.
Java Heap Size of both JVMs showed significant growth starting Dec 2nd and Dec 4th resulting in a crash on Dec 6th for both JVMs when the 512MB Max Heap Size was exceeded.
Step 2: Identify problematic Java Objects
The out-of-memory exception automatically triggers a full memory dump that allows for an analysis of which objects consumed the heap and are most likely to be the root cause of the out-of-memory crash. Looking at the objects that consumed most of the heap below indicates that they are related to a third-party logging API used by the application.
Sorting by GC (Garbage Collection) Size and focusing on custom classes (instead of system classes) shows that 80% of the heap is consumed by classes of a third-party logging framework
A closer look at an instance of the VPReportEntry4 shows that it contains five strings - with one consuming 23KB (as compared to several bytes of other string objects).This also explains the high GC Size of the String class in the overall Heap Dump.
Individual very large String objects as part of the ReportEntry object
Following the referrer chain further up reveals the complete picture. The EventQueue keeps LogEvents in an Array, which keeps VPReportEntrys in an Array. All of these objects seem to be kept in memory as the objects are being added to these arrays but never removed and therefore not garbage collected:
Following the referrer tree reveals that global EventQueue objects hold on to the LogEvent and VPReportEntry objects in array lists which are never removed from these arrays
Step 3: Who allocates these objects?
Analyzing object allocation allows us to figure out which part of the code is creating these objects and adding them to the queue. Creating what is called a "Selective Memory Dump" when the application reached 75% Heap Utilization showed the customer that the ReportWriter.report method allocated these entries and that they have been "living" on the heap for quite a while.
It is the report method that allocates the VPReportEntry objects that stay on the heap for quite a while
Step 4: Why are these objects not removed from the Heap?
The premise of the third-party logging framework is that log entries will be created by the application and written in batches at certain times by sending these log entries to a remote logging service using JMS. The memory behavior indicates that even though these log entries might be sent to the service, these objects are not always removed from the EventQueue leading to the out-of-memory exception.
Further analysis revealed that the background batch writer thread calls a logBatch method, which loops through the event queue (calling EventQueue.next) to send current log events in the queue. The question is whether as many messages were taken out of the queue (using next) vs put into the queue (using add) and whether the batch job is really called frequently enough to keep up with the incoming event entries. The following chart shows the method executions of add, as well as the call to logBatch highlighting that logBatch is actually not called frequently enough and therefore not calling next to remove messages from the queue:
The highlighted area shows that messages are put into the queue but not taken out because the background batch job is not executed. Once this leads to an OOM and the system restarts it goes back to normal operation but older log messages will be lost.
Step 5: Fixing the Java Memory Leak problem
After providing this information to the third-party provider and discussing with them the number of log entries and their system environment the conclusion was that our customer used a special logging mode that was not supposed to be used in high-load production environments. It's like running with DEBUG log level in a high load or production environment. This overwhelmed the remote logging service and this is why the batch logging thread was stopped and log events remained in the EventQueue until the out of memory occurred.
After making the recommended changes the system could again run with the previous heap memory size without experiencing any out-of-memory exceptions.
The Memory Leak issue has been solved and the application now runs even with the initial 512MB Heap Space without any problem.
They still use the same dashboards they have built to troubleshoot this issue, and to monitor for any future excessive logging problems.
These dashboards allow them to verify that the logging framework can keep up with log messages after they applied the changes.
Adding additional memory to crashing JVMs is most often not a temporary fix. If you have a real Java memory leak it will just take longer until the Java runtime crashes. It will even incur more overhead due to garbage collection when using larger heaps. The real answer to this is to use the simple approach explained here. Look at the memory metrics to identify whether you have a leak or not. Then identify which objects are causing the issue and why they are not collected by the GC. Working with engineers or third-party providers (as in this case) will help you find a permanent solution that allows you to run the system without impacting end users and without additional resource requirements.
If you want to learn more about Java Memory Management or general Application Performance Best Practices check out our free online Java Enterprise Performance Book. Existing customers of our APM Solution may also want to check out additional best practices on our APM Community.
SYS-CON Events announced today that MathFreeOn will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. MathFreeOn is Software as a Service (SaaS) used in Engineering and Math education. Write scripts and solve math problems online. MathFreeOn provides online courses for beginners or amateurs who have difficulties in writing scripts. In accordance with various mathematical topics, there are more tha...
Oct. 25, 2016 01:15 PM EDT Reads: 1,034
The best way to leverage your Cloud Expo presence as a sponsor and exhibitor is to plan your news announcements around our events. The press covering Cloud Expo and @ThingsExpo will have access to these releases and will amplify your news announcements. More than two dozen Cloud companies either set deals at our shows or have announced their mergers and acquisitions at Cloud Expo. Product announcements during our show provide your company with the most reach through our targeted audiences.
Oct. 25, 2016 12:45 PM EDT Reads: 4,929
@ThingsExpo has been named the Top 5 Most Influential Internet of Things Brand by Onalytica in the ‘The Internet of Things Landscape 2015: Top 100 Individuals and Brands.' Onalytica analyzed Twitter conversations around the #IoT debate to uncover the most influential brands and individuals driving the conversation. Onalytica captured data from 56,224 users. The PageRank based methodology they use to extract influencers on a particular topic (tweets mentioning #InternetofThings or #IoT in this ...
Oct. 25, 2016 12:30 PM EDT Reads: 8,474
Traditional on-premises data centers have long been the domain of modern data platforms like Apache Hadoop, meaning companies who build their business on public cloud were challenged to run Big Data processing and analytics at scale. But recent advancements in Hadoop performance, security, and most importantly cloud-native integrations, are giving organizations the ability to truly gain value from all their data. In his session at 19th Cloud Expo, David Tishgart, Director of Product Marketing ...
Oct. 25, 2016 12:00 PM EDT Reads: 2,668
The Internet of Things will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform and how we integrate our thinking to solve complicated problems. In his session at 19th Cloud Expo, Craig Sproule, CEO of Metavine, will demonstrate how to move beyond today's coding paradigm ...
Oct. 25, 2016 11:45 AM EDT Reads: 3,797
Although it has gained significant traction in the consumer space, IoT is still in the early stages of adoption in enterprises environments. However, many companies are working on initiatives like Industry 4.0 that includes IoT as one of the key disruptive technologies expected to reshape businesses of tomorrow. The key challenges will be availability, robustness and reliability of networks that connect devices in a business environment. Software Defined Wide Area Network (SD-WAN) is expected to...
Oct. 25, 2016 11:45 AM EDT Reads: 2,088
SYS-CON Events announced today that StarNet Communications will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. StarNet Communications’ FastX is the industry first cloud-based remote X Windows emulator. Using standard Web browsers (FireFox, Chrome, Safari, etc.) users from around the world gain highly secure access to applications and data hosted on Linux-based servers in a central data center. ...
Oct. 25, 2016 11:45 AM EDT Reads: 2,163
OnProcess Technology has announced it will be a featured speaker at @ThingsExpo, taking place November 1 - 3, 2016, in Santa Clara, California. Dan Gettens, OnProcess’ Chief Analytics Officer, will discuss how Internet of Things (IoT) data can be leveraged to predict product failures, improve uptime and slash costly inventory stock. @ThingsExpo is an annual gathering of IoT and cloud developers, practitioners and thought-leaders who exchange ideas and insights on topics ranging from Big Data in...
Oct. 25, 2016 11:42 AM EDT Reads: 165
Developing software for the Internet of Things (IoT) comes with its own set of challenges. Security, privacy, and unified standards are a few key issues. In addition, each IoT product is comprised of (at least) three separate application components: the software embedded in the device, the back-end service, and the mobile application for the end user’s controls. Each component is developed by a different team, using different technologies and practices, and deployed to a different stack/target –...
Oct. 25, 2016 11:30 AM EDT Reads: 2,020
Virgil consists of an open-source encryption library, which implements Cryptographic Message Syntax (CMS) and Elliptic Curve Integrated Encryption Scheme (ECIES) (including RSA schema), a Key Management API, and a cloud-based Key Management Service (Virgil Keys). The Virgil Keys Service consists of a public key service and a private key escrow service.
Oct. 25, 2016 11:30 AM EDT Reads: 1,117
SYS-CON Events announced today that CDS Global Cloud, an Infrastructure as a Service provider, will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. CDS Global Cloud is an IaaS (Infrastructure as a Service) provider specializing in solutions for e-commerce, internet gaming, online education and other internet applications. With a growing number of data centers and network points around the world, ...
Oct. 25, 2016 11:30 AM EDT Reads: 3,598
Big Data has been changing the world. IoT fuels the further transformation recently. How are Big Data and IoT related? In his session at @BigDataExpo, Tony Shan, a renowned visionary and thought leader, will explore the interplay of Big Data and IoT. He will anatomize Big Data and IoT separately in terms of what, which, why, where, when, who, how and how much. He will then analyze the relationship between IoT and Big Data, specifically the drilldown of how the 4Vs of Big Data (Volume, Variety,...
Oct. 25, 2016 11:15 AM EDT Reads: 1,522
SYS-CON Events announced today that Tintri Inc., a leading producer of VM-aware storage (VAS) for virtualization and cloud environments, will present at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Tintri VM-aware storage is the simplest for virtualized applications and cloud. Organizations including GE, Toyota, United Healthcare, NASA and 6 of the Fortune 15 have said “No to LUNs.” With Tintri they manag...
Oct. 25, 2016 11:15 AM EDT Reads: 3,643
Data is the fuel that drives the machine learning algorithmic engines and ultimately provides the business value. In his session at Cloud Expo, Ed Featherston, a director and senior enterprise architect at Collaborative Consulting, will discuss the key considerations around quality, volume, timeliness, and pedigree that must be dealt with in order to properly fuel that engine.
Oct. 25, 2016 11:15 AM EDT Reads: 3,921
From wearable activity trackers to fantasy e-sports, data and technology are transforming the way athletes train for the game and fans engage with their teams. In his session at @ThingsExpo, will present key data findings from leading sports organizations San Francisco 49ers, Orlando Magic NBA team. By utilizing data analytics these sports orgs have recognized new revenue streams, doubled its fan base and streamlined costs at its stadiums. John Paul is the CEO and Founder of VenueNext. Prior ...
Oct. 25, 2016 11:00 AM EDT Reads: 3,681