/ July 2013 ~ Java EE Support Patterns

7.26.2013

Plumbr 3.0 – An evolutive approach for Java memory leaks

Most Java developers are familiar with the concept of “memory leaks”. In the Java world, it essentially means the constant expansion of a particular memory space such as the Java heap, PermGen & Metaspace (Java 8) or the native memory. Such leaks are often the result of Java code problems from your applications, third part API’s or even inside the Java EE container and JDK code base. The level of complexity to pinpoint and resolve these types of problems can vary from medium to very high.

The good news is that the JVM has evolves into a great piece of engineering. The GC algorithms, troubleshooting and runtime data exposure capabilities of the JVM are now quite mature. We have access to most critical runtime data via the JMX API. All JVM vendors also provide JVM Thread Dump & Heap Dump generation techniques. Finally, we also have access to a great arsenal of tools and Java profilers to monitor the JVM internals and analyze memory dumps such as JVisualVM, Eclipse MAT or other commercial products.

That being said, even with the help of all these tools, the analysis process of Java memory leaks remains quite challenging. The skillset requirement for individuals performing such analysis is also high since it requires proper knowledge of the Java heap, troubleshooting and runtime analysis techniques.

I recently had the chance to experiment with the latest version of a Java memory leak analyzer tool that I’m sure you heard about: Plumbr. Plumbr’s approach can be resumed as this: instead of analyzing the memory dumps “after the fact” or following an OutOfMemoryError condition, why not writing a program that keeps tract of the Java objects and detect precisely memory leak suspects at runtime…and with live production traffic?

This article will share the positive experience I had with this product while experimenting and “engineering” a Java heap memory leak with the latest version of WildFly (formerly known as JBoss AS). I will also share my personal tips on how to improve your troubleshooting experience with Plumbr by combining additional tools or “synergists”. A complete Plumbr tutorial video will also be available from my YouTube channel in the near future.

Plumbr has also the capabilities to identify class loader related memory leaks affecting the PermGen space of the HotSpot JVM. I may publish more articles in the future addressing this particular type of leak with the help of Plumbr.

Memory leak simulator and environment specifications

In order to give Plumbr some challenges, the following memory leak was engineered:

  • A JAX-RS (REST) Web Service was created and exposed via the jvmleak URI as per below attributes.
    @GET
    @Path("/jvmleak")
    @Produces(MediaType.APPLICATION_JSON)
    public Integer jvmLeak() {}

Each invocation of jvmleak is performing the following logic:

1.     Allocate a high amount of short-lived objects (no reference).
2.     Allocate a small amount of long-lived objects (normal or hard references) to a static ConcurrentHashMap data structure.
3.     Returns the current count of the “leaking” ConcurrentHashMap.

We also created 3 extra Java classes:

  • JVMMemoryAllocator. This class is responsible to perform the short-lived and long-lived memory allocations.
  • ShortLivedObj. This class represents a short-lived object with no reference.
  • LongLivedObj. This class represents a long-lived object with hard references.
You will find below the environment specifications and software’s used to create and execute this Java heap memory leak simulator.

Plumbr download and installation

Plumbr is packaged as a “trialware”. This means that you can download the full version for free and determine if your application(s) contain memory leak(s). The location of the memory leak (code level) will require your or your client to purchase a Plumbr license plan as per your needs.

I recommend that you first install Plumbr to a development environment prior to production. Plumbr is a lightweight agent but extra due diligence is always recommended in order to reduce risk for your production environment.

  1. Login / register at https://portal.plumbr.eu/. This will allow you to download the Plumbr client.

  1. Unzip the archive at a location of your choice (local environment, development server etc.).
  2. If your Java application is running on Windows OS, you can simply double click the plumbr.jar located under the Plumbr root directory or use the following command: java -jar plumbr.jar. This will launch the Plumbr client and allow you to “attach” it to a running JVM process. Another approach is to simply edit your application JVM start-up settings and add the following argument: -javaagent:<PLUMBR ROOT DIRECTORY>plumbr.jar. I personally prefer the second approach.
Now back to our memory leak simulator, you will notice from the snapshot below that we added the Plumbr agent reference to the WildFly JVM start-up arguments inside the IDE directly.


After the restart of your JVM process, you will notice a similar message at the beginning of the standard output log:

************************************************************
* Plumbr (B1422) is attached.                              *
* Running with JRE from …\jdk1.7.0_09\jre                  *
* Your license is valid                                    *
*   for 20 more days                                       *
*   for Pierre-Hugues Charbonneau.                         *
* Plumbr agent is connected to the Plumbr Portal.          *
* Open up https://portal.plumbr.eu to follow its progress. *
************************************************************

This is telling us that Plumbr is now active and connected to the Plumbr Portal or reporting server. Now go to Plumbr portal and verify if it is properly receiving data from your JVM process.




Memory leak simulator warm-up

Now that the Plumbr agent is connected, it is now time to fire our memory leak simulator. In order to simulate an interesting memory leak, we will use the following load testing specifications:

  • JMeter will be configured to execute our REST Web Service URI jvmleak.
  • Our JMeter thread group will be configured to run “forever” with 20 concurrent threads.
  • JvisualVM and the Plumbr Portal will both be used to monitor the Java heap utilization along with the internal Plumbr MBeans.

Please note that launching JVisualVM is optional but I definitely recommend that you use it in conjunction with the MBeans browser plugin (add-on for JVisualVM). This will allow you to track the status and progress of the Plumbr agent directly from JVisualVM. It is important to note you will need to generate load and get several Full GC iterations before Plumbr is able to identify true memory leaks. The elapsed time will also depend of the activity (load) and nature of your application.



Since we have not fire JMeter at this point, you can see that Plumbr is still in its WARMUP phase with 10% completion.

Memory leak simulator execution

Generating an adequate amount of load to your application will allow Plumbr to complete its WARMUP phase fairly quickly. We can now see from our simulator that Plumbr is in RUNNING state and tracking several thousands of objects already. However, we had no Full GC at this point so we will need to wait until a few iterations are performed. This will allow the internal Plumbr computing engine to do its assessment and potentially narrow down the location of a memory leak.




You can notice that number of objects tracked by Plumbr will vary in between Full GC iterations. This is because Plumbr will typically only focus on objects that are “suspect” such as objects able to survive major collections (Full GC).

After a few major collections, Plumbr was able to detect a potential memory leak with 98% confidence. We are almost there…It is important to note that no OutOfMemoryError was thrown. The leak can also be observed from JVisualVM but it is not that obvious such early in the load test life cycle.




Finally, after more major collections, Plumbr was able to detect a memory leak with a confidence level of 100%. As you can see from the snapshots, JVisualVM did allow us to easily monitor the Java heap and Plumbr MBeans status progressions.




You will also notice the following message in the standard output logs.
15:22:41,270 INFO  [stdout] (Plumbr thread - 176) ******************************************************************************
15:22:41,271 INFO  [stdout] (Plumbr thread - 176) * Plumbr has found a memory leak.                                             *
15:22:41,271 INFO  [stdout] (Plumbr thread - 176) * Plumbr will now stop tracking to leave more resources for your application. *
15:22:41,271 INFO  [stdout] (Plumbr thread - 176) * You can find the detailed memory leak report by opening this url:           *
15:22:41,272 INFO  [stdout] (Plumbr thread - 176) * https://portal.plumbr.eu/report/5515                                        *
15:22:41,272 INFO  [stdout] (Plumbr thread - 176) ******************************************************************************

At this point we have to wait for Plumbr to analyze the location and type of leak found. Once the analysis process is completed, a report will be generated and available from the Plumbr Portal.








Plumbr memory leak report

Plumbr 3.0 provides centralized memory leak report capabilities. Each time Plumbr is able to detect a new memory leak, a report is created and uploaded to the reporting Portal. We can see below that a new report was generated from our fresh memory leak simulator execution.



The report is the ultimate outcome and deliverable of this exercise: a precise location, at code level, of where the memory leak is located.




We can see from the above report that Plumbr was perfectly able to identify our memory leak. You can notice that the leak report is split into 4 main sections:

  • The header contains the number of leaks found along with the detail on the memory footprint occupied by the leaking objects vs. the total Java heap capacity.
  • Leaking object type: This represents the object type of the instances accumulating over time in between major collections.
  • Leak creation location: This represents the caller and Java class where the leaking objects are created.
  • Memory references: This represents the object reference tree where the leaking objects are still referenced or held.

In our case, Plumbr was able to identify the exact location of our engineered memory leak.

  • LongLivedObj is indeed the expected leaking object type.
  • JVMMemoryAllocator is definitely the Java class where the leak is created.
  • ConcurrentHashMap is the implemented "referencer" or container.
Interestingly, Plumbr was also able to identify another potential leak inside WildFly 8 Alpha 3 itself…

Now please keep in mind that following the creation of the report, Plumbr will not magically resolve the memory leak for you. You will need to spend some time and review the affected code, especially the location where the leak is created and understand why the objects are still referenced. Plumbr will do 50%+ of the work. You will need to take care of the other half and determine the proper code fix required, upgrade or patch of the offending API(s).

Final words

I hope that you appreciated this tutorial and review of Plumbr 3.0 while simulating a true Java heap memory leak. This product is looking quite promising and I definitely recommend that you give Plumbr a try in your environment; especially if you suspect that you are dealing with memory leak problems affecting your production environment stability. Thumbs up to the Plumbr core team!

I’m looking forward for your comments and please share your experience with Plumbr.

7.09.2013

Java Just-In-Time compilation: more than just a buzzword

A recent Java production performance problem forced me to revisit and truly appreciate the Java VM Just-In-Time (JIT) compiler. Most Java developers and support individuals have heard of this JVM run time performance optimization but how many truly understand and appreciate its benefits?

This article will share with you a troubleshooting exercise I was involved with following the addition of a new virtual server (capacity improvement and horizontal scaling project).

For a more in-depth coverage of JIT, I recommend the following articles:

## Just-in-time compilation

## The Java HotSpot Performance Engine Architecture

## Understanding Just-In-Time Compilation and Optimization

## How the JIT compiler optimizes code

JIT compilation overview

The JIT compilation is essentially a process that improves the performance of your Java applications at run time.

The diagram below illustrates the different JVM layers and interaction. It describes the following high level process:

  1. Java source files are compiled by the Java compiler into platform independent bytecode or Java class files.
  2. After your fire your Java application, the JVM loads the compiled classes at run time and execute the proper computation semantic via the Java interpreter.
  3. When JIT is enabled, the JVM will analyze the Java application method calls and compile the bytecode (after some internal thresholds are reached) into native, more efficient, machine code. The JIT process is normally prioritized by the busiest method calls first.
  4. Once such method call is compiled into machine code, the JVM executes it directly instead of “interpreting” it.
  5. The above process leads to improved run time performance over time.


Case study

Now here is the background on the project I was referring to earlier. The primary goal was to add a new IBM P7 AIX virtual server (LPAR) to the production environment in order to improve the platform’s capacity. Find below the specifications of the platform itself:

  • Java EE server: IBM WAS 6.1.0.37 & IBM WCC 7.0.1
  • OS: AIX 6.1
  • JDK: IBM J2RE 1.5.0 (SR12 FP3 +IZ94331) @64-bit
  • RDBMS: Oracle 10g
  • Platform type: Middle tier and batch processing
In order to achieve the existing application performance levels, the exact same hardware specifications were purchased. The AIX OS version and other IBM software’s were also installed using the same version as per existing production.

The following items (check list) were all verified in order to guarantee the same performance level of the application:

  • Hardware specifications (# CPU cores, physical RAM, SAN…).
  • OS version and patch level; including AIX kernel parameters.
  • IBM WAS & IBM WCC version, patch level; including tuning parameters.
  • IBM JRE version, patch level and tuning parameters (start-up arguments, Java heap size…).
  • The network connectivity and performance were also assessed properly.
After the new production server build was completed, functional testing was performed which did also confirm a proper behaviour of the online and batch applications.

However, a major performance problem was detected on the very first day of its production operation. You will find below a summary matrix of the performance problems observed.

Production server
Operation elapsed time
Volume processed (# orders)
CPU %
(average)
Middleware health
Existing server
10 hours
250 000 (baseline)
20%
healthy
*New* server
10 hours
50 000   -500%
80%  +400%
High thread utilization

As you can see from the above view, the performance results were quite disastrous on the first production day. Not only much less orders were processed by the new production server but the physical resource utilization such as CPU % was much higher compared with the existing production servers.

The situation was quite puzzling given the amount of time spent ensuring that the new server was built exactly like the existing ones. At that point, another core team was engaged in order to perform extra troubleshooting and identify the source of the performance problem.

Troubleshooting: searching for the culprit...

The troubleshooting team was split in 2 in order to focus on the items below:

  • Identify the source of CPU % from the IBM WAS container and compare the CPU footprint with the existing production server.
  • Perform more data and file compares between the existing and new production server.
In order to understand the source of the CPU %, we did perform an AIX CPU per Thread analysis from the IBM JVM running IBM WAS and IBM WCC. As you can see from the screenshot below, many threads were found using between 5-20% each. The same analysis performed on the existing production server did reveal fewer # of threads with CPU footprint always around 5%.



Conclusion: the same type of business process was using 3-4 times more CPU vs. the existing production server.

In order to understand the type of processing performed, JVM thread dumps were captured at the same time of the CPU per Thread data. Now the first thing that we realized after reviewing the JVM thread dump (Java core) is that JIT was indeed disabled! The problem was also confirmed by running the java –version command from the running JVM processes.


This finding was quite major, especially given that JIT was enabled on the existing production servers. Around the same time, the other team responsible of comparing the servers did finally find differences between the environment variables of the AIX user used to start the application. Such compare exercise was missed from the earlier gap analysis. What they found is that the new AIX production server had the following extra entry:

JAVA_COMPILER=NONE

As per the IBM documentation, adding such environment variable is one of the ways to disable JIT.

Complex root cause analysis, simple solution

In order to understand the impact of disabling JIT in our environment, you have to understand its implication. Disabling JIT essentially means that the entire JVM is now running in interpretation mode. For our application, running in full interpretation mode not only reduces the application throughput significantly but also increases the pressure point on the server CPU utilization since each request/thread takes 3-4 more times CPU than a request executed with JIT (remember, when JIT is enabled, the JVM will perform many calls to the machine/native code directly).

As expected, the removal of this environment variable along with the restart of the affected JVM processes did resolve the problem and restore the performnance level.

Assessing the JIT benefits for your application

I hope you appreciated this case study and short revisit of the JVM JIT compilation process. In order to understand the impact of not using JIT for your Java application, I recommend that you preform the following experiment:

  • Generate load to your application with JIT enabled and capture some baseline data such as CPU %, response time, # requests etc.
  • Disable JIT.
  • Redo the same testing and compare the results.
I’m looking forward for your comments and please share any experience you may have with JIT.