uncovering 3 performance optimizations in apache pinot's protobuf ingestion

a customer flame graph at 2–3M events/sec surfaced a rocksdb cache miss, redundant field lookups in the hot path, and a 2.9× faster decoder most users don't know about.

when
jan 2026
where
apache pinot · real-time ingestion path · customer production
workload
protobuf ingestion @ ~2–3M events/sec peak · dedup enabled
tooling
async-profiler, flame graphs, jmh, rocksdb stats
status
finding #2 merged · apache/pinot#17593 · findings #1 & #3 shared as recommendations
original
linkedin article
tl;dr

Three independent findings in a single customer flame graph, combining to an estimated 15–20% overall ingestion CPU reduction:

  • rocksdb data-block cache — 90% hit rate was hiding a 10% miss rate on the dedup critical path · ~7.5–10% CPU
  • protobuf field extractionfindFieldByName() + per-field allocation in a hot loop · up to 4× extractor speedup, ~0.6–2.1% ingestion CPU
  • dynamic vs code-gen decoderProtoBufCodeGenMessageDecoder is 2.9× faster than the reflective default · ~6–7% ingestion CPU

it started in production

A customer was running a high-throughput protobuf ingestion pipeline on Pinot, processing roughly 2–3 million events per second at peak. While profiling their production Pinot servers, three independent hotspots stood out in the ingestion path:

  1. rocksdb operationsRocksDB.get / RocksDB.put calls used for dedup metadata
  2. protobuf extractionProtoBufRecordExtractor.extract() with repeated findFieldByName() calls
  3. protobuf parsingDynamicMessage.Builder.mergeFrom() consuming significant CPU
Customer production CPU flame graph showing three hotspots in Pinot protobuf ingestion: RocksDB operations, ProtoBufRecordExtractor.extract, and DynamicMessage.Builder.mergeFrom
customer production flame graph (~2–3M events/sec) — three hotspots visible: rocksdb get/put, ProtoBufRecordExtractor.extract(), and DynamicMessage.Builder.mergeFrom().

finding #1 — rocksdb data-block cache miss

Digging into the RocksDB metrics for the dedup workload, I saw a split:

metricvaluestatus
overall block cache hit rate~95%healthy
data block cache hit rate~90%lagging

Overall looked fine, but the data block cache specifically was missing 10% of the time. For a dedup workload where every ingested message triggers a get/put, that 10% compounds fast. Each miss means:

the fix

Increase the block cache size to drive the data block hit rate to 95%+. Expected impact:

finding #2 — protobuf field extraction: the hidden redundancy

Looking at ProtoBufRecordExtractor.extract(), I noticed this running for every field of every message:

// Called for EVERY message, EVERY field
Descriptors.FieldDescriptor fd = descriptor.findFieldByName(fieldName);

findFieldByName() is a HashMap lookup — fast in isolation. But at 10+ fields × millions of messages/sec, it becomes measurable. Worse, a new ProtoBufFieldInfo object was being allocated for every single field extracted:

// Before: new allocation per field
return new ProtoBufFieldInfo(value, fieldDescriptor);

the fix

  1. cache the field descriptors on the first message — reuse them for every subsequent message
  2. reuse a single ProtoBufFieldInfo instance instead of allocating per field
// After: cached array lookup + reusable object
_reusableFieldInfo.setFieldValue(value);
_reusableFieldInfo.setFieldDescriptor(cachedFieldDescriptors[i]);

benchmark — extract optimization (jmh)

extraction patternspeedupest. overall ingestion CPU saved
all fields1.19×~0.6%
subset (5 fields, common in prod)3.5–4×~2.1%
single field~0.9%

The biggest wins come from subset extraction — the common case in production where Pinot tables pull only a handful of fields from each protobuf message.

Status: merged upstream as apache/pinot #17593.

finding #3 — dynamicmessage: the real bottleneck

The biggest single CPU consumer was DynamicMessage.Builder.mergeFrom() — the core protobuf parsing. This isn't a bug; it's inherent to how DynamicMessage works. When you use .desc files with Pinot's default ProtoBufMessageDecoder:

the alternative: ProtoBufCodeGenMessageDecoder

Pinot has a lesser-known decoder — ProtoBufCodeGenMessageDecoder — that requires compiled protobuf classes (a .jar) but uses runtime code generation via Janino for direct field access.

benchmark — dynamic vs code-gen (jmh)

decoderrelative throughput
ProtoBufMessageDecoder (reflective)1.00×
ProtoBufCodeGenMessageDecoder2.8–2.9×

For this customer's workload (protobuf decode ≈ 10% of ingestion CPU), switching could reduce overall ingestion CPU by 6–7%. Shared as a recommendation with the customer.

which decoder should you use?

situationrecommendation
only have .desc, want to avoid jar ops burdenProtoBufMessageDecoder (with caching fix)
can compile protos to jarProtoBufCodeGenMessageDecoder
schema changes frequentlyProtoBufMessageDecoder (more flexible)
maximum throughputProtoBufCodeGenMessageDecoder

putting it all together

Stacking the three findings — cache tuning, extractor caching, and decoder swap — puts ~15–20% of overall ingestion CPU on the table, depending on the customer's exact workload characteristics.

takeaways

  1. profile in production. test environments rarely show the full picture; real customer workloads reveal patterns synthetic tests can't produce.
  2. question the defaults. the default decoder was flexible and easy, but not fast. evaluate alternatives when performance matters.
  3. small inefficiencies compound. a HashMap lookup in nanoseconds becomes significant at millions of ops per second.
  4. look beyond the obvious. the rocksdb cache miss and the extractor overhead were hiding in plain sight — overshadowed by the more visible parsing cost.

merged upstream (finding #2): apache/pinot #17593  ·  full write-up: original linkedin article