a 40% json parsing speedup in apache pinot, found hiding in plain sight

how pre-production profiling revealed a two-line optimization that had been hiding behind three years of "good" refactors.

when
jan 2026
where
apache pinot · streaming ingestion path
workload
high-throughput kafka → pinot json ingestion (preprod + production)
tooling
async-profiler, flame graphs, jmh
status
merged · apache/pinot#17485
original
linkedin article
tl;dr

JSONMessageDecoder.decode() was parsing every message twice — once into a Jackson JsonNode tree, then again into a Map. A two-line change that binds the bytes directly to a Map via ObjectReader produced 35–55% speedup on small/medium payloads and ~35–40% on large payloads, translating to an estimated 3–5% reduction in overall ingestion CPU across billions of daily events.

it started in preprod

StarTree maintains a pre-production environment that mirrors production ingestion patterns — similar Kafka topics, message rates, payload shapes. I do most of my profiling there before validating findings against production.

While analyzing flame graphs from a high-throughput JSON ingestion pipeline, I noticed an unusual hotspot in RealtimeSegmentDataManager.consumeLoop() — the critical path that processes every message from the stream.

The stack trace pointed to two methods in the JSON decoding path:

At first glance, nothing unusual. But the CPU time told a different story: these two methods together accounted for ~80% of the CPU time spent in JSONMessageDecoder.decode(), and decode() itself was ~30% of total ingestion CPU in preprod.

PreProd CPU flame graph showing JSONMessageDecoder.decode() dominated by bytesToJsonNode and jsonNodeToMap
preprod cpu profile — decode() ≈ 30% of ingestion cpu; bytesToJsonNode + jsonNodeToMap ≈ 80% of decode().

validating in production

PreProd findings are hypotheses until confirmed. I pulled flame graphs from a production cluster running a similar workload. Same pattern, same hotspot — decode() was ~13% of overall ingestion CPU with the same two-method split. This wasn't an artifact of the test environment.

Production CPU flame graph showing the same JSONMessageDecoder.decode() hotspot pattern as preprod
production cpu profile — same hotspot pattern as preprod; decode() ≈ 13% of ingestion cpu, 80% of it in the two parse methods.

the anti-pattern: parsing json twice

Here's what was happening for every JSON message:

// Step 1: Parse bytes into Jackson's tree model
JsonNode message = JsonUtils.bytesToJsonNode(payload, offset, length);

// Step 2: Convert tree model into a Map
Map<String, Object> jsonMap = JsonUtils.jsonNodeToMap(message);

Two parsing phases. Two object-graph constructions. Double the allocations and GC pressure.

why did this survive?

The pattern had evolved across multiple refactors over three years:

  • 2022: centralized ObjectMapper usage (good change, kept the pattern)
  • 2025: eliminated byte-array copies (good change, kept the pattern)

each refactor solved its own problem. none questioned the fundamental approach.

the fix

Jackson's ObjectReader can deserialize directly to a target type. No intermediate tree needed:

public static Map<String, Object> bytesToMap(
        byte[] jsonBytes, int offset, int length) throws IOException {
    return DEFAULT_READER.forType(MAP_TYPE_REFERENCE)
        .readValue(jsonBytes, offset, length);
}

Two lines. That's the entire change.

why not jackson streaming api?

I benchmarked it. Streaming API is ~30% faster for tiny payloads (~100 bytes) but converges to roughly the same performance for larger payloads (>1 KB). The reason is that ObjectMapper's data binding already uses the Streaming API internally — it just builds the Map for us. We're not bypassing streaming, we're avoiding double-construction.

Other reasons for ObjectMapper: no maintenance burden (~100 lines of manual parsing vs 1 line), automatic edge-case handling (BigDecimal, duplicate keys, etc.). Streaming API shines when you're extracting a few fields from large documents — not this workload.

benchmark results

JMH microbenchmark with realistic payload profiles (before / after the fix):

JMH benchmark results table showing 35-55% speedup across payload sizes
jmh benchmark — ~35–55% speedup on small/medium payloads, ~35–40% on large payloads.
payload profilespeedup
small to medium (most common in streaming)35–55%
large (~2 KB)35–40%

Given decode()'s share of ingestion CPU, the expected impact on overall ingestion is 3–5%. That sounds modest until you remember this code runs across billions of daily events — the saved CPU cycles turn into real capacity headroom: more throughput, fresher data, and infrastructure that breathes easier under load.

takeaways

  1. invest in production-like preprod. being able to profile realistic workloads before touching production lets you iterate faster and with confidence.
  2. always validate in production. preprod findings are hypotheses; production confirms them.
  3. question inherited patterns. "we've always done it this way" isn't a reason. code that made sense years ago may not make sense today.
  4. the biggest wins are often the simplest. this wasn't a clever algorithm. it was recognizing we were doing the same work twice.

merged upstream: apache/pinot #17485  ·  full write-up: original linkedin article