Skip to main content
  1. Posts/

A Peculiar Bug Hunt: When Exceptions Lose Their Voice

·1195 words·6 mins
NeatGuyCoding
Author
NeatGuyCoding

A Peculiar Bug Hunt: When Exceptions Lose Their Voice
#

Recently, our team has been working with a third-party SDK for some development tasks. Everything was running smoothly until production suddenly started throwing errors, and we encountered something really bizarre. One of my teammates came to me saying, “The code just stops executing halfway through and skips an entire section!”

The code in question was surprisingly straightforward:

try {
    log.info("initiate client with conf: {}", conf);
    SDKClient client = new SDKClient(conf);
    client.init();
    log.info("client initiated");
} catch (Exception e) {
    log.error("initiate client failed", e);
}
log.info("start to manipulate...");

We discovered that the client wasn’t actually initializing successfully, causing all subsequent business logic to fail. When we checked the logs, here’s what we found:

initiate client with conf: xxxxx
start to manipulate...

This was exactly what my teammate meant by “code jumping around.” We never saw either client initiated or initiate client failed in the logs - it just went straight to start to manipulate...

Detective Work with Arthas
#

For those familiar with our setup, we run on k8s + Docker with Arthas built into every image, using Java 16 with JFR enabled. Our logs include trace information and get pulled to a centralized logging server via ELK Agent.

Since the SDK requires IP whitelisting for remote access, we couldn’t directly test against their production environment locally. Our local tests against their staging environment worked perfectly fine. So we had to rely on Arthas for our investigation.

First, let’s verify that the production code matches what we see locally using the jad command:

jad fully.qualified.class.name

After decompiling, we confirmed the code was identical to our source.

Next, let’s observe the actual execution flow:

trace fully.qualified.class.name method

After re-executing the method and examining the trace, we found that an exception was indeed being thrown during initialization:

# Omitting irrelevant details

    +---[min=0.010174ms,max=0.01184ms,total=0.022014ms,count=2] org.apache.logging.log4j.Logger:info() #130
    +---[min=599.388978ms,max=630.23967ms,total=1229.628648ms,count=2] com.dasha13.sdk.SDKClient:<init>() #131
    +---[min=203.617545ms,max=221.785512ms,total=425.403057ms,count=2] com.dasha13.sdk.SDKClient:init() #132 [throws Exception,2]
    +---[min=0.034798ms,max=0.084505ms,total=0.119303ms,count=2] org.apache.logging.log4j.Logger:error() #136
    +---[min=0.010174ms,max=0.01184ms,total=0.022014ms,count=2] org.apache.logging.log4j.Logger:info() #138

But here’s the mystery: why wasn’t this exception being logged? Let’s dig deeper using the watch command with a depth of 2 to capture both the stack trace and message:

watch com.dasha13.sdk.SDKClient init {throwExp} -x 2

However, we only got what appeared to be a message-like output:

method=com.dasha13.sdk.SDKClient init location=AtExceptionExit
ts=2021-08-10 02:58:15; [cost=131.20209ms] result=ERROR DATA!!! object class: class java.util.ArrayList, exception class: class com.google.common.util.concurrent.UncheckedExecutionException, exception message: java.lang.IllegalArgumentException

This was strange. Normally, with depth set to 2, if an exception is thrown, the output should include both the exception message and stack trace. What’s going on here? Let’s try getting the stack trace and message separately.

First, the stack trace:

watch com.dasha13.sdk.SDKClient init {throwExp.getStackTrace()} -x 2

Re-executing the problematic method, the stack trace output normally. Looking at it, the issue seems related to Google’s dependency injection framework Guice (similar to Spring) failing to load a bean:

ts=2021-08-10 03:03:37; [cost=146.644563ms] result=@ArrayList[
    @StackTraceElement[][
        @StackTraceElement[com.google.inject.internal.InjectorImpl$2.get(InjectorImpl.java:1025)],
        @StackTraceElement[com.google.inject.internal.InjectorImpl.getInstance(InjectorImpl.java:1051)],
        @StackTraceElement[com.dasha13.sdk.SDKClient.init(SDKClient.java:482)],
        # Additional stack trace omitted

Now let’s check the exception message:

watch com.dasha13.sdk.SDKClient init {throwExp.getMessage()} -x 2

Re-executing the method, we found that the watch command failed:

watch failed, condition is: null, express is: {throwExp.getMessage()}, com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException, visit /app/arthas/arthas.log for more details.

Following the suggestion, we checked the arthas log and found this exception stack trace:

2021-08-10 03:07:11 [XNIO-2 task-3] ERROR c.t.a.c.command.express.OgnlExpress -Error during evaluating the expression:
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException                                
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)                                
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937)                                       
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)                           
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)                                 
        at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4830)    
        at com.google.inject.internal.util.StackTraceElements.forMember(StackTraceElements.java:66)                     
        at com.google.inject.internal.Errors.formatSource(Errors.java:806)                                          
        at com.google.inject.internal.Errors.formatSource(Errors.java:785)                                               
        at com.google.inject.internal.Errors.formatInjectionPoint(Errors.java:839)                                                 
        at com.google.inject.internal.Errors.formatSource(Errors.java:800)                           
        at com.google.inject.internal.Errors.formatSource(Errors.java:785)                                    
        at com.google.inject.internal.Errors.format(Errors.java:584)                                            
        at com.google.inject.ProvisionException.getMessage(ProvisionException.java:60)       
cause by: MethodNotFoundException: Method not found: class com.google.common.xxxxxxxxx 

Aha! We discovered that the ProvisionException’s getMessage() method was throwing an exception - an exception’s getMessage() was causing an exception! Looking at the root cause, we identified a version incompatibility between Guava and Guice. The fundamental issue was that the third-party interface was timing out, causing initialization failures. These exceptions were wrapped in ProvisionException, but since ProvisionException’s getMessage relies on Guava Cache to cache exception information, and our project’s Guava version was incompatible with the Guice version, certain methods didn’t exist. This caused ProvisionException’s getMessage to also throw exceptions. Previously, this worked fine because the third-party interface hadn’t experienced timeouts during initialization.

Why Exception Information Wasn’t Being Logged
#

We use log4j2’s asynchronous logging configuration, passing exceptions as the last parameter to logging methods. Under normal circumstances, this would output the exception’s message and stack trace. But as we discovered above, getting the message itself was throwing an exception. Log4j uses a producer-consumer architecture for log events. Here, the consumer tries to get the exception’s message and stack trace, but encounters an exception while getting the message. For Log4j2 async logging, when an exception occurs, the original log event gets discarded entirely, and the exception is output to the StatusLogger (which essentially goes to standard error output). This corresponds to the log4j source code:

AppenderControl

private void tryCallAppender(final LogEvent event) {
    try {
        //Call appender to output log
        appender.append(event);
    } catch (final RuntimeException error) {
        //Handle RuntimeException
        handleAppenderError(event, error);
    } catch (final Exception error) {
        //Handle other Exceptions
        handleAppenderError(event, new AppenderLoggingException(error));
    }
}

private void handleAppenderError(final LogEvent event, final RuntimeException ex) {
    appender.getHandler().error(createErrorMsg("An exception occurred processing Appender "), event, ex);
    if (!appender.ignoreExceptions()) {
        throw ex;
    }
}

The ErrorHandler is typically the default implementation, DefaultErrorHandler, which outputs to a StatusLogger:

DefaultErrorHandler

private static final Logger LOGGER = StatusLogger.getLogger();
public void error(final String msg, final LogEvent event, final Throwable t) {
    final long current = System.nanoTime();
    if (current - lastException > EXCEPTION_INTERVAL || exceptionCount++ < MAX_EXCEPTIONS) {
        LOGGER.error(msg, t);
    }
    lastException = current;
    if (!appender.ignoreExceptions() && t != null && !(t instanceof AppenderLoggingException)) {
        throw new AppenderLoggingException(msg, t);
    }
}

StatusLogger essentially outputs to standard error System.err:

StatusLogger

this.logger = new SimpleLogger("StatusLogger", Level.ERROR, false, true, showDateTime, false,
                dateFormat, messageFactory, PROPS, 
                //Standard error output
                System.err);

In our deployment architecture, standard error output goes to a rather obscure location that basically nobody checks, so we missed it. When we finally looked at the standard error output, we indeed found the exception:

2021-08-10 03:30:29,810 Log4j2-TF-10-AsyncLoggerConfig-3 ERROR An exception occurred processing Appender file com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2203)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
	at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4830)
	at com.google.inject.internal.util.StackTraceElements.forMember(StackTraceElements.java:66)
	at com.google.inject.internal.Errors.formatSource(Errors.java:806)
	at com.google.inject.internal.Errors.formatSource(Errors.java:785)
	at com.google.inject.internal.Errors.formatInjectionPoint(Errors.java:839)
	at com.google.inject.internal.Errors.formatSource(Errors.java:800)
	at com.google.inject.internal.Errors.formatSource(Errors.java:785)
	at com.google.inject.internal.Errors.format(Errors.java:584)
	at com.google.inject.ProvisionException.getMessage(ProvisionException.java:60)
	at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:105)
	at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:93)
	at org.apache.logging.log4j.core.impl.Log4jLogEvent.getThrownProxy(Log4jLogEvent.java:629)
	at org.apache.logging.log4j.core.pattern.ExtendedThrowablePatternConverter.format(ExtendedThrowablePatternConverter.java:63)
	at org.springframework.boot.logging.log4j2.ExtendedWhitespaceThrowablePatternConverter.format(ExtendedWhitespaceThrowablePatternConverter.java:50)
	at org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38)
	at org.apache.logging.log4j.core.layout.PatternLayout$PatternSerializer.toSerializable(PatternLayout.java:345)
	at org.apache.logging.log4j.core.layout.PatternLayout.toText(PatternLayout.java:244)
	at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:229)
	at org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:59)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:197)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:190)
	at org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:181)
	at org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:312)
	at org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
	at org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
	at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:543)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.callAppenders(AsyncLoggerConfig.java:127)
	at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:502)
	at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:485)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.log(AsyncLoggerConfig.java:121)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfig.logToAsyncLoggerConfigsOnCurrentThread(AsyncLoggerConfig.java:169)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:111)
	at org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor$Log4jEventWrapperHandler.onEvent(AsyncLoggerConfigDisruptor.java:97)
	at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.IllegalArgumentException
	at com.google.inject.internal.asm.$ClassReader.<init>(Unknown Source)
	at com.google.inject.internal.asm.$ClassReader.<init>(Unknown Source)
	at com.google.inject.internal.asm.$ClassReader.<init>(Unknown Source)
	at com.google.inject.internal.util.LineNumbers.<init>(LineNumbers.java:65)
	at com.google.inject.internal.util.StackTraceElements$1.load(StackTraceElements.java:43)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
	... 41 more

Furthermore, based on the Appender’s ignoreExceptions configuration (which defaults to true), it determines whether the calling location should throw an exception. However, this is for synchronous logging. Async logging throws exceptions to Disruptor’s exception handler, and Log4j2’s Disruptor exception handling also outputs exceptions to System.err (standard error output). By default, exceptions aren’t thrown because for synchronous logging, nobody wants business logic to fail just because of logging issues. For async logging, since the previous processing already outputs to standard error, there’s no need for redundancy.

Related

A Hidden Production Issue Discovered Through SQL Optimization
·1101 words·6 mins
When our operations team brought us a complex SQL query that was taking forever to execute, we thought it was just a performance issue. Little did we know, this investigation would uncover a deeply hidden character encoding mismatch that had been silently causing full table scans in our production database.