StarRocks 4.0: Delivering Query-Ready Data to Apache Iceberg

Join StarRocks Community on Slack

TABLE OF CONTENTS

Data in your Apache Iceberg tables often doesn't land in a way that's optimized for queries. Continuous micro-batch writes create tens of thousands of small files, and even the fastest background compaction jobs can’t guarantee the data is query-ready at the moment users care most — right after ingestion. As a result, queries slow down, resources spike, and costs steadily climb.

Why Query Optimization Alone Can't Solve the Performance Problem

Query optimizations can prune, cache, and vectorize all day—but they can't undo a storm of tiny files. In practice, performance becomes inconsistent when:

Small files multiply during distributed, concurrent, multi-partition writes.
Data lands unsorted, blunting pruning and I/O coalescing.

Traditional “fix-it-later” compaction helps, but it's heavy, asynchronous, and often misses the moment users care about most—right after ingest.

Bring Data Warehouse Discipline to the Lakehouse

Traditional data warehouses rarely struggle with small files or erratic performance because data lands optimized — merged and often sorted before it ever hits storage — while lightweight services quietly keep things healthy over time.

We apply the same principle to Apache Iceberg, built around two layers of optimization:

Ingestion-first: Data is intelligently routed to avoid writer overlap, buffered and merged before flush, and written as large, well-structured files. That means data is query-ready the moment it lands — not hours later after heavy maintenance jobs.
Compaction service: A steady background service that continuously folds small files into query-sized files and keeps partitions balanced—throttled, skip-hot, and always available for instant fixes.

Together, they make Iceberg tables behave more like warehouse tables:

Ingestion stays fast and stable under heavy load.
New data is immediately query-ready.
Compaction to keep your data optimized for the best query performance

How StarRocks Brings This Approach to Life

StarRocks is a high-performance SQL engine built for low-latency, high-concurrency queries on open formats like Apache Iceberg. It powers everything from real-time dashboards to large-scale customer-facing analytics, where query performance directly shapes the user experience.

What's New in StarRocks 4.0

With version 4.0, StarRocks goes beyond query execution. It pushes performance into the data layer, optimizing how data is written, organized, and maintained.

Global Shuffle Ingestion

A new global shuffle mechanism intelligently routes data to avoid overlapping writes across backends. Each node handles a distinct subset of partitions, producing fewer, larger files instead of thousands of tiny ones. This significantly reduces metadata overhead and improves query scan efficiency — especially in high-partition workloads.

Spill-Aware Writes

Instead of forcing premature flushes when memory pressure builds, StarRocks now buffers as much as possible in memory and automatically spills to disk or object storage when needed. This prevents data files to be flushed early to prevent OOM, ensuring they remain close to the target size and ingestion stays stable even with thousands of partitions.

Compaction API

When maintenance is needed, for example, after many micro-batches, StarRocks 4.0 introduces a new Compaction API. It uses the same shuffle, spill, and sort logic as ingestion to quickly merge files on demand. Instead of relying on external tooling you can use StarRocks to fix your data layout issues wherever needed.

Local Sort for Query-Friendly Layouts

On the file level, StarRocks can now write sorted data during ingestion. This ensures that each file is internally ordered and pruning-friendly, which improves query latency without requiring a separate sorting job downstream.

The result: once data is written, it's not just persisted—it's immediately optimized for fast, predictable queries.

Benchmarks

We conducted a series of ingestion tests to compare StarRocks 4.0 with version 3.4 on Apache Iceberg tables. The tests measured write latency, file count, and average file size across workloads with 100, 500, and 1000 partitions. The results are shown below:

Number of partitions	100	500	1000
Lantency (StarRocks-3.4)	OOM (> 33min)	OOM (> 46min)	OOM
Lantency (StarRocks-4.0)	5 min 30 s	8 min 9 s	10 min 15 s
File Count (StarRocks-3.4)	>170000	> 230000	> 350000
File Count (StarRocks-4.0)	259	596	2000

Improved stability under scale: The previous version frequently encountered OOM errors when writing to 100 or more partitions. Version 4.0 successfully ingests data with up to 1000 partitions without failures.
Substantially reduced latency: Ingestion time is reduced by more than half for 100 partitions and by roughly three-quarters for 500 partitions, significantly shortening end-to-end data freshness.
Fewer, larger files: The new ingestion path produces far fewer files. For example, file count dropped from over 170,000 to 259 with 100 partitions, and from over 230,000 to 596 with 500 partitions.

Conclusion: A Query-Ready Lakehouse

Iceberg provides the openness and governance the modern data lakehouse demands. But performance requires more than an open format—it requires the discipline of a warehouse managing files beneath the surface.

With StarRocks 4.0, that discipline comes built in. Data lands query-ready, ingestion stays stable, and compaction becomes a quick, on-demand operation. The result is a lakehouse that feels as fast and predictable as a warehouse—without sacrificing openness.

If you’d like to learn more, join the Release Webinar to see a deep dive into StarRocks 4.0 and explore how these capabilities can power your low-latency, high-concurrency workloads. [Register here]

copy success

StarRocks 4.0: Bringing Native Columnar Performance to JSON

JSON was never designed for analytics. It’s extremely flexible, universal, and perfect for fast-changing data—but once those JSON logs ...

Sida Shen

How to Seamlessly Accelerate Data Lake Queries

Slow data lake queries often compel users to transfer their data and workload to a data warehouse for query acceleration. This process ...