Skip to main content
Java Memory Efficiency: Moving Collectors, Heap Knobs, and Profiling First
  1. Posts/

Java Memory Efficiency: Moving Collectors, Heap Knobs, and Profiling First

·2541 words·12 mins
NeatGuyCoding
Author
NeatGuyCoding

Java Memory Efficiency: Moving Collectors, Heap Knobs, and Profiling First
#

A Java process holding several gigabytes of heap in Task Manager is often labeled “bloated.” Shift the lens from absolute gigabytes to CPU cycles spent on memory management, and the picture changes: the default-path G1 and optional ZGC in the JDK are both moving collectors—they deliberately trade more RAM for lower memory-management CPU overhead. In Inside Java Podcast 59, Oracle Java architect Ron Pressler frames this trade-off as “using RAM chips as program accelerators”; GPU acceleration gets praise while RAM acceleration gets called bloated (speaker’s view). The sections below unpack the mechanics by platform-design theme; quantitative economics and cloud-sizing experience are labeled by source boundary; GC algorithm details follow official JEPs and HotSpot source.

JAVAONE'26 studio two-person interview: background screen shows JAVAONE'26 branding; segment discussing moving collectors and the Appel paper.


Moving garbage collection: trading RAM for CPU
#

Why
#

Manual malloc/free, reference counting, and non-moving mark-sweep all pay continuously to “reuse an address”: freeing, coalescing free blocks, fighting fragmentation. Moving collectors take a different path—during compaction they relocate live objects into contiguous regions, and space occupied by dead objects can be overwritten directly on subsequent allocation, so the allocation side advances almost linearly. In his 1987 paper Garbage collection can be faster than stack allocation, Andrew W. Appel argued that under certain conditions GC can be faster than stack allocation; Ron cites this work to show that moving collectors are not an “after-the-fact patch” but an algorithmic choice to reduce memory-management CPU overhead (“RAM as accelerator” is the speaker’s summary, not the paper’s wording).

JEP 333 describes ZGC as a compacting collector that reclaims memory through object relocation; JEP 248 made G1 the server default from JDK 9 onward. By contrast, the removed CMS (JEP 291 deprecated it; JDK 14 removed it) is explicitly no support for compaction in JEP 333 Alternatives—a different design space from the moving route. The official reason for removing CMS is lack of maintainers and availability of G1/ZGC/Shenandoah replacements (speaker’s view: the moving route is overall superior—Ron’s summary, not JEP 363’s text).

Mechanism and constraints
#

DimensionMoving (G1 / ZGC)Non-moving (e.g. CMS, Go mark-sweep)
RAMHigher: retains garbage, reserves compaction spaceUsually lower
Memory-management CPUTunable via heap size; tends lowerfree/fragmentation costs more direct
Ops knobs-Xmx and other heap bounds ≈ “memory knob”Speaker’s view: lacks equivalent knobs

Even with the knob turned toward “save memory as much as possible,” a moving collector may still use more RAM than a non-moving design, but gains higher throughput (speaker’s view). Economically, RAM is usually cheaper than CPU; Ron gives an order-of-magnitude “10× more RAM for 5% CPU” in a JavaOne companion talk (not repeated in this podcast; figure not independently verified).

G1 and ZGC should not be conflated: JEP 189 states G1 performs evacuation but does not do concurrent evacuation; ZGC’s concurrent relocation is a different tier of implementation. On JDK 27, JEP 523 proposes making G1 the default GC for all environments (Integrated), continuing the moving main path.

Mermaid diagram 1

How to
#

Selection and migration: -XX:+UseConcMarkSweepGC is ignored on JDK 14+ and falls back to the default GC (JEP 363). Prefer G1 for new deployments; evaluate ZGC for low-latency scenarios (JDK 15 GA, JEP 377). No API-level breaking change—the impact is on the operations side.

# View current GC (JDK 9+)
java -XX:+PrintFlagsFinal -version 2>&1 | grep Use.*GC

# Explicitly enable ZGC (example)
java -XX:+UseZGC -Xmx4g -jar app.jar

Common misconceptions
#

  • Treating “Java uses memory” as an implementation oversight—in the moving model, some RAM is a deliberate trade-off for CPU.
  • Equating G1 and ZGC concurrent-moving capabilities—evacuation timing and pause characteristics differ.
  • Using JEP 363’s text to prove “moving is necessarily better”—the official wording is about maintainability and alternatives; overall superiority is the architect’s summary.

Wide shot: JAVAONE ‘26 backdrop and ORACLE bookshelf; two people seated discussing moving GC and RAM/CPU trade-offs.


Whole-machine view: RAM-to-CPU ratio
#

Why
#

Backend sizing that looks only at “how many GB this pod has” invites misjudgment. Ron defines the key metric as how much RAM per CPU core (speaker’s view), inspired by a keynote from Oracle GC team member Erik Österlund at the SIGPLAN ISMM workshop (specific keynote claims not retrieved for full verification; Österlund’s identity appears under Reviewed by in JEP 377). The intuition: if a process persistently saturates a core’s CPU, almost no other program can use that core’s RAM effectively—“100% CPU + 1% RAM” earns no bonus points (speaker’s view).

Kubernetes Pod resource management describes CPU and memory via request/limit; there is no global hard rule of “at least 1 GB per core.” Ron says the smallest cloud instances commonly offer ≥ 1 GB RAM / core (speaker’s view; cloud vendor SKUs change over time—verify yourself). When Nicolai asks whether Java loses on “few-GB” small instances, Ron says no and stresses RAM/core over absolute GB (specific benchmarks are in the JavaOne talk; the podcast does not provide them).

Mechanism and constraints
#

Bottleneck-shift logic: when CPU is scarce, using idle RAM to ease GC and allocation pressure aligns with general performance engineering (consensus discussion). Historically, Java was desktop-heavy and users watched RAM in Task Manager; servers care more about throughput, latency, and cost curves (speaker’s view).

How to
#

On K8s or VMs, declare both CPU and memory requests—avoid pressing only a memory limit while CPU saturates:

resources:
  requests:
    cpu: "2"
    memory: "4Gi"
  limits:
    memory: "4Gi"

When pairing with -Xmx, keep the heap cap below the container memory limit to reserve headroom for metaspace, thread stacks, off-heap memory, and GC—the ratio depends on workload; there is no single official formula.

Common misconceptions
#

  • Comparing Java to Go/Rust by absolute GB alone—ignoring RAM/core and whether the workload is compute-bound.
  • Treating Ron’s 1 GB/core as a Kubernetes specification—Kubernetes docs have no such global floor.
  • Using https://www.ismm.org/ as the memory-management ISMM—that domain points to a mountain-medicine society; the correct entry is SIGPLAN ISMM.

Ron gesturing in front of the JAVAONE'26 backdrop: when CPU is saturated, RAM should be used effectively rather than left idle.


Allocation model: TLAB, compaction, and heap knobs
#

Why
#

Java heap allocation has no per-object free semantics: threads allocate by advancing a _top pointer inside a ThreadLocalAllocBuffer (TLAB) (inline implementation), approaching linear bump speed; reclamation is batched by GC during compaction. Ron uses roughly constant live object count in steady state as a premise: GC work on the live set is weakly related to garbage volume—the larger the heap, the more can be allocated before fill, so collection frequency is roughly inversely proportional to heap size (idealized steady-state inference; no corresponding JEP). Double the heap and similar preservation work happens about half as often; heap ×10, roughly ten fewer collections—counterexample: heap long-running at 90–99% capacity causing thrashing (general GC experience; HotSpot does not encode that percentage).

Mechanism and constraints
#

On the ZGC side, partial “knob” evolution exists: JEP 351’s -XX:ZUncommit can return unused heap to the OS; JEP 377 introduces -XX:SoftMaxHeapSize as a soft cap—not the same as users no longer setting -Xmx. Ron says the GC team is working to eliminate manual heap sizing so the collector decides the optimal heap and users only express “prefer RAM or prefer CPU,” with ZGC possibly leading (speaker’s view; as of 2026-06-10 no GA JEP eliminates -Xmx).

Low-level languages can approximate similar efficiency with arenas (large buffer + internal bump, released in one shot at phase end); Ron says Zig supports this well (speaker’s view). JEP 454’s Arena defines temporal bounds—partial overlap with Zig-style allocators, semantics not fully equivalent.

Mermaid diagram 2

How to
#

Observe allocation and GC frequency, then tune the heap—rather than copying an “industry standard ratio” first:

# JFR at startup; observe allocation and GC (JDK 11+)
java -XX:StartFlightRecording=filename=recording.jfr,duration=120s \
  -Xmx2g -jar app.jar

# Dump while running
jcmd <pid> JFR.dump filename=recording.jfr

Try ZGC soft cap and uncommit (requires JDK 15+; -Xms should not equal -Xmx or ZUncommit is effectively disabled—see JEP 351):

java -XX:+UseZGC -Xmx8g -XX:SoftMaxHeapSize=4g \
  -XX:+ZUncommit -XX:ZUncommitDelay=300 -jar app.jar

Common misconceptions
#

  • Assuming SoftMaxHeapSize already removes manual heap configuration—ergonomics still center on -Xms/-Xmx.
  • “I can beat GC with my own free logic”—usually means reimplementing a memory manager; expert-grade malloc investment is enormous (Ron cites Stefan Johansson and others, speaker’s view).
  • Equating FFM Arena with Zig arenas—JEP 454 governs segment lifetime boundaries.

Discussion under JAVAONE'26 branding on the inverse relationship between doubling heap size and GC frequency.


Cross-language comparison: fragmentation, allocators, and no free lunch
#

Why
#

“C/C++ uses less memory” often comes from micro-benchmarks, not long-running production shapes. Ron notes C/C++ face fragmentation and long-term degradation too: heap behavior after an hour can differ radically from the start (speaker’s view; the Go GC Guide has official description of the non-moving path). C/C++ can switch malloc implementations with effects comparable to switching GC; modern allocators are as complex as GC (speaker’s view).

The Go GC Guide states Go uses a mark-sweep, non-moving GC and pre-allocates address space to limit fragmentation. Analogy to the defunct HotSpot CMS helps intuition, but Go’s concurrent mark-sweep and CMS algorithms are not equivalent (speaker’s view: Go-style path is “similar” to CMS). The non-moving path saves RAM but cannot trade RAM for CPU the way moving collectors can—aligned with the P01 argument direction; quantitative extrapolation remains opinion.

Mechanism and constraints
#

At the micro level, cache hit vs miss on the same address can mean roughly 100× latency difference, depending on other threads’ behavior (speaker’s view)—isolated micro-benchmarks rarely extrapolate to whole programs. Experienced C++ programmers may avoid some fragmentation, but complexity spreads across compiler, CPU microarchitecture, and cache layers (speaker’s view).

How to
#

Cross-language evaluation should fix workload and observation window, not a single malloc loop:

# Record RSS and CPU time when comparing (example)
/usr/bin/time -l ./native_app
/usr/bin/time -l java -Xmx2g -jar app.jar

At JNI/FFM boundaries, use JEP 454 Arena to define off-heap lifetimes and avoid mixing semantics with JVM GC.

Common misconceptions
#

  • Extrapolating from equal-size-object malloc/free loops to Java—different object sizes and lifetime distributions invalidate the leap.
  • Equating Go non-moving with CMS—the former is still active and implemented differently.
  • Ignoring that allocators are long-lived state machines too—switching malloc implementation is not “zero-cost engine swap.”

Studio discussion analogizing switching allocators to switching GC: OCR shows bre SS. = 7 me fragment.

JAVAONE'2 backdrop segment: cross-language discussion of fragmentation and compaction.


Benchmarks lie: profiling first
#

Why
#

The performance community often says benchmarks lie (speaker’s view; the JMH README emphasizes pitfalls but does not use that exact phrase)—the problem is often exponential extrapolation fallacy, not fabricated table numbers. Micro-benchmarks that repeatedly malloc/free equal-size objects show C++/Rust matching Java speed with less RAM; on real programs (different sizes, lifetimes) that does not extrapolate (speaker’s view).

Comparative third-party benchmarks (“A faster than B”) are almost useless for programs you did not write; what helps is benchmarking your own code while understanding mechanisms and fast paths. Micro-benchmark relevance has declined overall—compilers can generate very different code for isolated snippets vs whole-program context (speaker’s view).

Mechanism and constraints
#

Recommended workflow (speaker’s view):

  1. Profile the whole program first (JFR, Java Mission Control); find hot paths.
  2. On hot-path implementations, use JMH to compare candidates.
  3. Must validate on the whole program again—otherwise you violate Amdahl’s law: optimizing a path that is 1% of time yields at most ~1% overall gain.

JMH reduces pitfalls like dead-code elimination, but a professional veneer can make @Benchmark + tables feel like the finish line; Aleksei Shipilev stresses numbers are a starting point and you need to understand causes (Nicolai paraphrasing, not Shipilev speaking directly). Micro-benchmark blog posts about others’ APIs (e.g. Stream) can be harmful (Ron’s leaning; speaker’s view).

What usually slows Java applications is not stream vs loop—it is wrong algorithms/data structures, excessive synchronization, I/O (databases, etc.)—most information systems are not compute-bound (speaker’s view).

How to
#

# Start recording (see java(1) manual)
java -XX:StartFlightRecording=filename=recording.jfr,duration=60s -jar app.jar

# Dump while running (see jcmd(1) manual)
jcmd <pid> JFR.dump filename=recording.jfr

# Text view of hotspots
jfr print --events jdk.ExecutionSample recording.jfr

Run JMH in a standalone Maven/Gradle project; avoid running the main class directly from the IDE (README says results are less reliable).

Common misconceptions
#

  • Using JMH scores to prove language superiority—conclusions are invalid when extrapolation boundaries are undefined.
  • Skipping whole-program validation—misidentified hot paths send optimization in the wrong direction.
  • Conflating profiling (your program) with benchmarking (often someone else’s program)—“every program is a snowflake” (speaker’s original imagery).

Background bookshelf with ORACLE branding; segment on whether benchmarks can validate real programs.

JAVAONE'26 backdrop discussing profiling vs benchmarking: OCR contains ert and how fragments.

Speaker with hands spread, Java neon logo backdrop, explaining performance methodology.

Close-up: Java logo and plant wall backdrop; discussion of JMH and whole-program validation.


Structured Concurrency roadmap aside
#

This section is off-topic at the podcast close—not part of the memory theme; API details follow the JEP.

Why
#

Structured Concurrency binds child-task lifetimes to a parent scope, simplifying cancellation and error propagation. At recording time (JDK 26 had just shipped), Ron said the team originally planned to make it permanent in JDK 27 (speaker’s view).

Mechanism and constraints
#

As of 2026-06-10, JEP 533 delivers the API as seventh Preview in JDK 27 (Completed means that JEP’s delivery is done, not API finalization); Summary text: preview once more in JDK 27. That conflicts with Ron’s podcast intent—trust the JEP. Prior rounds: JEP 480 (JDK 23 third), JEP 499 (JDK 24 fourth), JEP 525 (JDK 26 sixth). JEP 12 preview APIs are disabled by default.

How to
#

javac --release 27 --enable-preview Main.java
java --enable-preview Main
import java.util.concurrent.StructuredTaskScope;

try (var scope = StructuredTaskScope.open(
        StructuredTaskScope.Configuration.newBuilder().build())) {
    var sub = scope.fork(() -> fetchData());
    scope.join();
    return sub.get();
}

Production use requires tracking JEP status; JDK 28+ may continue preview or eventually finalize (no Final JEP yet).

Common misconceptions
#

  • Writing “JDK 27 Structured Concurrency GA”—JEP 533 remains preview.
  • Citing JEP 499 as “Third Preview”—the third round is JEP 480 (JDK 23).

Closing segment: JAVAONE ‘26 and ORACLE bookshelf; the two shift from memory topics to roadmap chat.


References and further reading
#

Related

From Records to Deconstructible Types: Amber's Deconstruction–Reconstruction Path and Syntax Governance

·2504 words·12 mins
When JEP 395 bundles immutable carriers, nominal tuples, and record patterns, any evolution beyond its constraints loses both compact syntax and expressive power on the pattern-matching side. Project Amber is elevating fixed-component-shape deconstruction to a top-level type property, narrowing the narrative to deconstructible class in mail #2; JEP 468 (Candidate, preview) has long awaited a broader class-level deconstruction path. This article explains motivation, terminology alignable with public documentation, and how engineers should read preview features and the upcoming Pattern Assignment (no Preview JEP yet; see Amber features 2026 mail) in dependency order.