Skip to main content
  1. Posts/

Tackling a Mysterious JVM Safepoint Issue: A Journey from Problem to Solution

·1004 words·5 mins
NeatGuyCoding
Author
NeatGuyCoding
Table of Contents

Tackling a Mysterious JVM Safepoint Issue: A Journey from Problem to Solution
#

Recently, we encountered a peculiar issue in our production environment: certain high-volume logging applications would freeze for extended periods whenever the hourly log synchronization task kicked off at 5 minutes past each hour. Through careful examination of our safepoint logs (enabled via the startup parameter -Xlog:safepoint=trace:file=${LOG_PATH}/safepoint%t.log:utctime,level,tags:filecount=10,filesize=10M and utilizing Java 17’s -Xlog:async for asynchronous logging to prevent JVM log output from blocking the process), we discovered that all incidents were caused by:

  1. GC logs: -Xlog:gc*=debug:file=${LOG_PATH}/gc%t.log:utctime,level,tags:filecount=50,filesize=100M
  2. JIT compilation logs: -Xlog:jit+compilation=info:file=${LOG_PATH}/jit_compile%t.log:utctime,level,tags:filecount=10,filesize=10M
  3. Disabled stack trace omission: This only omits internal JDK exceptions like NullPointerException: -XX:-OmitStackTraceInFastThrow, which we had already optimized to handle performance pressure from excessive stack trace output during high error volumes

After implementing these configurations, our application exhibited a strange problem with three distinct manifestations, all sharing one common characteristic: exceptionally long safepoint durations.

1. Safepoint logs revealed extremely long wait times for all threads to reach the safepoint (Reaching safepoint: over 25 seconds)

image

2. Safepoint logs showed extended safepoint durations caused by GC (At safepoint: over 37 seconds)

image

Examining the GC logs, we noticed a significant delay between Heap before GC invocations and the heap structure output logs:

image

3. Another case of prolonged safepoint duration due to GC, but with different log interval patterns (over 29 seconds) The GC logs showed certain heap structure output logs with substantial delays:

image

Problem Investigation
#

When Java application threads are collectively in a safepoint, they become completely inactive. This means that application thread-dependent monitoring—such as external JVM monitoring through Spring Actuator’s Prometheus endpoints and Skywalking instrumentation—becomes blind to the actual issue. These monitoring tools only show that methods called during safepoints take exceptionally long, but this doesn’t indicate that these methods themselves are the bottleneck.

We needed to rely on internal JVM monitoring mechanisms, such as JVM logs and JFR (Java Flight Recording) for proper diagnosis. We also utilized async_profiler (https://github.com/jvm-profiling-tools/async-profiler/) because we observed that during problematic periods, the process’s own CPU usage (not the machine’s, but specifically this process) would spike dramatically:

image

Curiously, when examining CPU usage through async_profiler during the problematic timeframes, we found that aside from the expected patterns:

Most remarkably, during safepoint periods, logging appeared to be interrupted—an extremely rare occurrence. Here’s why this observation is significant:

For the first phenomenon (extended wait times for all threads to enter safepoint), we would typically expect continuous output of logs indicating which threads haven’t entered the safepoint, as referenced in the JVM source code:

https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/safepoint.cpp

image

However, in our first phenomenon, we didn’t observe any logs indicating which specific thread was causing the extended safepoint entry time.

For the second phenomenon, JFR analysis didn’t reveal any particularly time-consuming GC phases:

For the third phenomenon, examining the JVM source code revealed that between the two logs with large intervals, virtually nothing happens except logging. Additionally, all problematic timestamps occurred around 05 minutes past each hour. Upon consulting our operations team, we learned that this timing coincides with the previous hour’s log file removal and EFS synchronization (we generate one log file per hour), involving substantial file I/O operations (since we’re using cloud services, this might involve EFS-type NFS or network object storage rather than traditional disk I/O). Could excessive file I/O be blocking JVM log output and consequently freezing the entire JVM?

Why would JVM log output block all application threads? If a JVM thread gets stuck outputting logs outside of a safepoint, it would only block itself, not all application threads. However, if this occurs during a safepoint when all application threads are already suspended, a stuck JVM thread outputting logs could prevent all threads from entering the safepoint or cause extended safepoint durations. For the first phenomenon, a thread outputting JVM logs (not application logs—application log output typically involves native file I/O calls, which are considered to be in safepoint state and wouldn’t cause issues, as detailed in my other article: JVM Related - SafePoint and Stop The World Complete Guide: https://zhuanlan.zhihu.com/p/161710652) gets stuck, preventing that thread from entering the safepoint. For phenomena two and three, GC threads get stuck outputting JVM logs, preventing GC completion.

Let’s first examine the JVM source code to confirm whether stuck JVM log output can indeed block the JVM.

JVM Log Output Source Code Analysis
#

We’re using Java 17, which introduced asynchronous JVM log output (unavailable in earlier versions). For our source code analysis, let’s ignore the asynchronous logging code to understand pre-Java 17 log output behavior:

https://github.com/openjdk/jdk/blob/master/src/hotspot/share/logging/logFileStreamOutput.cpp

image

This code clearly shows that if file I/O output gets stuck, the flush operation will block. Additionally, there would be brief CPU spikes because the flushing wait strategy likely involves CPU spinning for a period before entering a blocked state.

What about switching to asynchronous logging? What parameters are available? JVM asynchronous logging was introduced in Java 17, corresponding to issue: https://bugs.openjdk.org/browse/JDK-8229517, with these key parameters:

image

Enable JVM asynchronous logging with -Xlog:async and specify the asynchronous log buffer size with -XX:AsyncLogBufferSize= (default is 2097152, or 2MB). The asynchronous logging principle works as follows:

image

Implementing Asynchronous Logging: Significant Improvement with Remaining Issues
#

We modified our configuration to use asynchronous logging by adding startup parameters: -Xlog:async and -XX:AsyncLogBufferSize=4194304. After observation, the problem was significantly alleviated:

image

However, one instance still experienced the issue, but with different symptoms. The safepoint logs now showed a thread continuously running and refusing to enter the safepoint:

image

What was this thread doing? We used jstack to identify the thread:

image

This was a scheduled thread for refreshing the microservice instance list, and the code wasn’t using WebFlux properly:

image

This improper use of asynchronous code can lead to JIT optimization errors (correct usage is called frequently, and this incorrect usage is also called frequently, causing continuous JIT C2 optimization and deoptimization). JFR analysis revealed numerous JIT deoptimizations during this period:

image

This can cause missing safepoints, leading to I/O operations continuously spinning and waiting for extended periods. We needed to correct the usage:

image

After implementing the proper approach, the issue of threads refusing to enter safepoints disappeared completely.

Related

Solving JVM Safepoint Delays: A Journey from EFS Integration to Async Logging
·970 words·5 mins
An in-depth investigation into mysterious JVM safepoint delays after upgrading to Java 17 and implementing centralized log collection with AWS EFS. We discovered how file I/O blocking during log output can freeze entire JVM processes and solved it using async logging and proper WebFlux implementation.
Troubleshooting Memory Issues After Spring Boot Upgrade: A Deep Dive into ResolvableType Object Creation
·1180 words·6 mins
An investigation into excessive memory allocation and YoungGC frequency after upgrading to Spring Boot 2.4.6 + Spring Cloud 2020.0.x, revealing how BeanUtils.copyProperties creates massive ResolvableType objects without caching in Spring 5.3.x versions.
OpenJDK JVM Deep Dive: Java Memory Model - A Comprehensive Guide to Concurrency and Memory Barriers
·12058 words·57 mins
A deep dive into Java Memory Model (JMM) from specification to implementation, covering memory barriers, CPU reordering, and Java 9+ VarHandle APIs. Learn about coherence, causality, consensus, and how volatile, final, and other synchronization mechanisms work under the hood with practical jcstress examples.