How Demandbase Built a Scalable Data Lakehouse with StarRocks and Iceberg

Join StarRocks Community on Slack

TABLE OF CONTENTS

Petabyte-Scale Data | Blazing Fast Ingestion | Open and Resilient

About Demandbase

Demandbase is the leading account-based go-to-market platform for B2B enterprises. The company helps businesses identify and target the right customers at the right time with the right message through unified intent data, AI-powered insights, and prescriptive actions.

Thousands of companies rely on Demandbase to maximize revenue and consolidate their data and tech stacks into one platform. The platform combines and processes marketing data from numerous sources, handling both bulk imports and streaming events. Core to the product is processing data at scale and delivering flexible, fast reporting and insights. To support this, Demandbase needed a high-performance data infrastructure that could handle growing and complex workloads while maintaining near real-time responsiveness.

Challenges

Demandbase faced a critical architectural decision as their data volumes scaled into the petabyte range. The company needed a solution that could handle massive data growth while supporting both batch and streaming workloads without compromising query performance.

The platform required several capabilities that traditional data warehouses struggled to deliver together.

Change data capture from PostgreSQL needed to flow seamlessly into both a data lake and a query engine for immediate customer access. Enrichment processes running nightly had to transform raw lake data into queryable warehouse tables without creating operational bottlenecks.
Cost became a major concern as data volumes grew. Storage expenses threatened to spiral as the platform stored more customer and third-party data. The team needed cost-effective, scalable storage that wouldn't break the budget at petabyte scale.
Disaster recovery and upgrade strategies also presented challenges. Losing a production cluster couldn't mean hours of downtime while teams scrambled to restore data. Similarly, upgrading to new database versions needed a seamless strategy that allowed testing and validation before cutting over production traffic.

The company evaluated multiple architectures but struggled to find a solution that could handle their unique requirements. They needed to support concurrent writes from multiple systems, separate storage from compute for cost efficiency and flexibility, and maintain strong separation between batch processing and real-time analytics. Traditional data warehouses either couldn't scale cost-effectively or required complex workarounds to support their streaming and batch workflows together.

Solution

After evaluating their architecture needs, Demandbase built a data lakehouse combining Apache Iceberg as the storage layer with StarRocks as the analytics engine.

demandbase starrocks-1

Apache Iceberg provided the storage foundation:

Open, cost-effective, scalable storage backed by cloud providers
Strong separation of storage and compute
Time travel and rollbacks
Comprehensive schema evolution, including partitioning

StarRocks served as the low-latency query engine:

Fast, real-time analytics for a responsive user experience
High-performance join queries to meet SLAs and shrink denormalization
Excellent stream loading with upsert handling
Query engine integration with Iceberg via external catalogs

The architecture connects the two through batch and streaming patterns.

For batch operations, StarRocks' INSERT OVERWRITE supports SELECT statements that allow the Demandbase team to query Iceberg directly. This allows partition-level updates from Iceberg on demand. StarRocks' built-in task queue manages these operations with configurable parallelization.

On the streaming side, Spark jobs read from and write to Kafka for StarRocks consumption. With StarRocks’ routine loads, users can pull from Kafka topics and update designated tables on the fly while maintaining state for tracking and supporting full-table, partial, and conditional update patterns. The data then can be inserted into primary key tables to enable create, update, and delete operations.

For Demandbase, a critical use case was net-new records in their data pipeline. Now, using partial routine loads, StarRocks streams create and delete changes for columns filled on creation, providing immediate visibility. After nightly Iceberg enrichment completes, StarRocks loads the full enriched data via INSERT INTO, acting as upserts for existing records.

Architecturally, Iceberg can now easily act as the source of truth, enabling enterprise disaster recovery strategies. Teams can now stand up new StarRocks clusters with region failover and load directly from Iceberg. For upgrades, they employ blue-green deployment by standing up a second cluster with the newer version, loading it from Iceberg, mirroring production traffic, then starting routine loads with different consumer groups. Traffic then migrates via Route 53 in a timely fashion.

And to top it off, Demandbase built a lightweight load service on top of StarRocks to streamline user requests and resource delegation. Internal teams submit requests specifying tables and partitions, and the service handles task submission, completion tracking, and retries using StarRocks' task management features. A similar service manages routine loads, automatically handling schema changes by stopping loads, validating upstream data, and creating updated routine loads.

Results

The StarRocks and Iceberg architecture delivered significant operational and cost benefits:

Petabyte-scale data storage with cost-effective lakehouse backing
Near real-time data freshness through partial routine loads
Robust disaster recovery with Iceberg as the source of truth
Seamless upgrades via blue-green deployment loaded from Iceberg
Simplified operations through load service abstraction
Performant on normalized schemas avoiding expensive denormalization
Lightning-fast analytics backing customer-facing reporting

The architecture proved particularly valuable for concurrent access patterns. Multiple systems write to Iceberg simultaneously while StarRocks queries the same data for analytics. Partitioned workloads during enrichment avoid conflicts, and teams report minimal issues with concurrent writes to the same Iceberg tables.

Performance comparisons during proof-of-concept testing showed significantly better query latency from StarRocks native tables compared to querying Iceberg directly via external catalogs. This justified the architecture of loading frequently accessed data into StarRocks while keeping rarely used tables as external Iceberg references.

What's Next for Demandbase

Demandbase continues expanding their StarRocks and Iceberg integration:

Expanding into conditional stream loading support to support predicate-based updates in routine loads
Invest more into priority retry handling for failed load tasks to prevent delays from queuing behind new submissions
Scale the usage of federated queries across environments

Want the story straight from the source? In this StarRocks Summit 2025 session, Connor Clark, Staff Software Engineer at Demandbase, breaks down how his team runs Iceberg and StarRocks together as a unified analytics platform. From nightly INSERT OVERWRITE batch loads with StarRocks Submit Task to Spark-driven CDC streams and a dual-loading strategy across Iceberg and StarRocks, he shows exactly how they keep customer-facing analytics both fast and reliable. Watch now!