A Peculiar Bug Hunt: When Exceptions Lose Their Voice#
Recently, our team has been working with a third-party SDK for some development tasks. Everything was running smoothly until production suddenly started throwing errors, and we encountered something really bizarre. One of my teammates came to me saying, “The code just stops executing halfway through and skips an entire section!”
The code in question was surprisingly straightforward:
try {
log.info("initiate client with conf: {}", conf);
SDKClient client = new SDKClient(conf);
client.init();
log.info("client initiated");
} catch (Exception e) {
log.error("initiate client failed", e);
}
log.info("start to manipulate...");
We discovered that the client wasn’t actually initializing successfully, causing all subsequent business logic to fail. When we checked the logs, here’s what we found:
initiate client with conf: xxxxx
start to manipulate...
This was exactly what my teammate meant by “code jumping around.” We never saw either client initiated
or initiate client failed
in the logs - it just went straight to start to manipulate...
Detective Work with Arthas#
For those familiar with our setup, we run on k8s + Docker with Arthas built into every image, using Java 16 with JFR enabled. Our logs include trace information and get pulled to a centralized logging server via ELK Agent.
Since the SDK requires IP whitelisting for remote access, we couldn’t directly test against their production environment locally. Our local tests against their staging environment worked perfectly fine. So we had to rely on Arthas for our investigation.
First, let’s verify that the production code matches what we see locally using the jad
command:
jad fully.qualified.class.name
After decompiling, we confirmed the code was identical to our source.
Next, let’s observe the actual execution flow:
trace fully.qualified.class.name method
After re-executing the method and examining the trace, we found that an exception was indeed being thrown during initialization:
# Omitting irrelevant details
+---[min=0.010174ms,max=0.01184ms,total=0.022014ms,count=2] org.apache.logging.log4j.Logger:info() #130
+---[min=599.388978ms,max=630.23967ms,total=1229.628648ms,count=2] com.dasha13.sdk.SDKClient:<init>() #131
+---[min=203.617545ms,max=221.785512ms,total=425.403057ms,count=2] com.dasha13.sdk.SDKClient:init() #132 [throws Exception,2]
+---[min=0.034798ms,max=0.084505ms,total=0.119303ms,count=2] org.apache.logging.log4j.Logger:error() #136
+---[min=0.010174ms,max=0.01184ms,total=0.022014ms,count=2] org.apache.logging.log4j.Logger:info() #138
But here’s the mystery: why wasn’t this exception being logged? Let’s dig deeper using the watch
command with a depth of 2 to capture both the stack trace and message:
watch com.dasha13.sdk.SDKClient init {throwExp} -x 2
However, we only got what appeared to be a message-like output:
method=com.dasha13.sdk.SDKClient init location=AtExceptionExit
ts=2021-08-10 02:58:15; [cost=131.20209ms] result=ERROR DATA!!! object class: class java.util.ArrayList, exception class: class com.google.common.util.concurrent.UncheckedExecutionException, exception message: java.lang.IllegalArgumentException
This was strange. Normally, with depth set to 2, if an exception is thrown, the output should include both the exception message and stack trace. What’s going on here? Let’s try getting the stack trace and message separately.
First, the stack trace:
watch com.dasha13.sdk.SDKClient init {throwExp.getStackTrace()} -x 2
Re-executing the problematic method, the stack trace output normally. Looking at it, the issue seems related to Google’s dependency injection framework Guice (similar to Spring) failing to load a bean:
ts=2021-08-10 03:03:37; [cost=146.644563ms] result=@ArrayList[
@StackTraceElement[][
@StackTraceElement[com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1025)],
@StackTraceElement[com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1051)],
@StackTraceElement[com.dasha13.sdk.SDKClient.init(SDKClient.java:482)],
# Additional stack trace omitted
Now let’s check the exception message:
watch com.dasha13.sdk.SDKClient init {throwExp.getMessage()} -x 2
Re-executing the method, we found that the watch command failed:
watch failed, condition is: null, express is: {throwExp.getMessage()}, com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException, visit /app/arthas/arthas.log for more details.
Following the suggestion, we checked the arthas log and found this exception stack trace:
2021-08-10 03:07:11 [XNIO-2 task-3] ERROR c.t.a.c.command.express.OgnlExpress -Error during evaluating the expression:
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4830)
at com.google.inject.internal.util.StackTraceElements.forMember(StackTraceElements.java:66)
at com.google.inject.internal.Errors.formatSource(Errors.java:806)
at com.google.inject.internal.Errors.formatSource(Errors.java:785)
at com.google.inject.internal.Errors.formatInjectionPoint(Errors.java:839)
at com.google.inject.internal.Errors.formatSource(Errors.java:800)
at com.google.inject.internal.Errors.formatSource(Errors.java:785)
at com.google.inject.internal.Errors.format(Errors.java:584)
at com.google.inject.ProvisionException.getMessage(ProvisionException.java:60)
cause by: MethodNotFoundException: Method not found: class com.google.common.xxxxxxxxx
Aha! We discovered that the ProvisionException
’s getMessage()
method was throwing an exception - an exception’s getMessage()
was causing an exception! Looking at the root cause, we identified a version incompatibility between Guava and Guice. The fundamental issue was that the third-party interface was timing out, causing initialization failures. These exceptions were wrapped in ProvisionException
, but since ProvisionException
’s getMessage
relies on Guava Cache to cache exception information, and our project’s Guava version was incompatible with the Guice version, certain methods didn’t exist. This caused ProvisionException
’s getMessage
to also throw exceptions. Previously, this worked fine because the third-party interface hadn’t experienced timeouts during initialization.
Why Exception Information Wasn’t Being Logged#
We use log4j2’s asynchronous logging configuration, passing exceptions as the last parameter to logging methods. Under normal circumstances, this would output the exception’s message and stack trace. But as we discovered above, getting the message itself was throwing an exception. Log4j uses a producer-consumer architecture for log events. Here, the consumer tries to get the exception’s message and stack trace, but encounters an exception while getting the message. For Log4j2 async logging, when an exception occurs, the original log event gets discarded entirely, and the exception is output to the StatusLogger (which essentially goes to standard error output). This corresponds to the log4j source code:
private void tryCallAppender(final LogEvent event) {
try {
//Call appender to output log
appender.append(event);
} catch (final RuntimeException error) {
//Handle RuntimeException
handleAppenderError(event, error);
} catch (final Exception error) {
//Handle other Exceptions
handleAppenderError(event, new AppenderLoggingException(error));
}
}
private void handleAppenderError(final LogEvent event, final RuntimeException ex) {
appender.getHandler().error(createErrorMsg("An exception occurred processing Appender "), event, ex);
if (!appender.ignoreExceptions()) {
throw ex;
}
}
The ErrorHandler is typically the default implementation, DefaultErrorHandler, which outputs to a StatusLogger:
private static final Logger LOGGER = StatusLogger.getLogger();
public void error(final String msg, final LogEvent event, final Throwable t) {
final long current = System.nanoTime();
if (current - lastException > EXCEPTION_INTERVAL || exceptionCount++ < MAX_EXCEPTIONS) {
LOGGER.error(msg, t);
}
lastException = current;
if (!appender.ignoreExceptions() && t != null && !(t instanceof AppenderLoggingException)) {
throw new AppenderLoggingException(msg, t);
}
}
StatusLogger essentially outputs to standard error System.err
:
this.logger = new SimpleLogger("StatusLogger", Level.ERROR, false, true, showDateTime, false,
dateFormat, messageFactory, PROPS,
//Standard error output
System.err);
In our deployment architecture, standard error output goes to a rather obscure location that basically nobody checks, so we missed it. When we finally looked at the standard error output, we indeed found the exception:
2021-08-10 03:30:29,810 Log4j2-TF-10-AsyncLoggerConfig-3 ERROR An exception occurred processing Appender file com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4830)
at com.google.inject.internal.util.StackTraceElements.forMember(StackTraceElements.java:66)
at com.google.inject.internal.Errors.formatSource(Errors.java:806)
at com.google.inject.internal.Errors.formatSource(Errors.java:785)
at com.google.inject.internal.Errors.formatInjectionPoint(Errors.java:839)
at com.google.inject.internal.Errors.formatSource(Errors.java:800)
at com.google.inject.internal.Errors.formatSource(Errors.java:785)
at com.google.inject.internal.Errors.format(Errors.java:584)
at com.google.inject.ProvisionException.getMessage(ProvisionException.java:60)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:105)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:93)
at org.apache.logging.log4j.core.impl.Log4jLogEvent.getThrownProxy(Log4jLogEvent.java:629)
at org.apache.logging.log4j.core.pattern.ExtendedThrowablePatternConverter.format(ExtendedThrowablePatternConverter.java:63)
at org.springframework.boot.logging.log4j2.ExtendedWhitespaceThrowablePatternConverter.format(ExtendedWhitespaceThrowablePatternConverter.java:50)
at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38)
at org.apache.logging.log4j.core.layout.PatternLayout$PatternSerializer.toSerializable(PatternLayout.java:345)
at org.apache.logging.log4j.core.layout.PatternLayout.toText(PatternLayout.java:244)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:229)
at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:59)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:197)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:190)
at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:181)
at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:312)
at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:543)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.callAppenders(AsyncLoggerConfig.java:127)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:502)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:485)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:121)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncLoggerConfigsOnCurrentThread(AsyncLoggerConfig.java:169)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:111)
at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:97)
at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.IllegalArgumentException
at com.google.inject.internal.asm.$ClassReader.<init>(Unknown Source)
at com.google.inject.internal.asm.$ClassReader.<init>(Unknown Source)
at com.google.inject.internal.asm.$ClassReader.<init>(Unknown Source)
at com.google.inject.internal.util.LineNumbers.<init>(LineNumbers.java:65)
at com.google.inject.internal.util.StackTraceElements$1.load(StackTraceElements.java:43)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
... 41 more
Furthermore, based on the Appender’s ignoreExceptions
configuration (which defaults to true), it determines whether the calling location should throw an exception. However, this is for synchronous logging. Async logging throws exceptions to Disruptor’s exception handler, and Log4j2’s Disruptor exception handling also outputs exceptions to System.err
(standard error output). By default, exceptions aren’t thrown because for synchronous logging, nobody wants business logic to fail just because of logging issues. For async logging, since the previous processing already outputs to standard error, there’s no need for redundancy.