
Join StarRocks Community on Slack
Connect on SlackFor years, denormalization has been the go-to workaround when query engines couldn’t handle complex joins at scale. Flatten your data, run a nightly pipeline, and pray it doesn’t break. Sound familiar? This approach works — until it doesn’t. As customer-facing analytics becomes more interactive and real-time, the cracks start to show. In this article, we’ll look at why you’d denormalize, what the costs of denormalization are, and, lastly, what options you have for avoiding these costs.
Why Teams Denormalize in the First Place
It’s not because they want to — it’s because they have to. Traditional analytical engines struggle with:
- Distributed JOINs across multiple large tables
- High-cardinality aggregations
- Multi-tenant workloads with unpredictable query shapes
To make dashboards responsive, engineers precompute as much as possible.
But Denormalization Comes at a Cost
Flattening data may save the cost of running joins on the fly, but it introduces deep architectural trade-offs that become painful at scale.
- Storage overhead: When fact and dimension data are merged, every row in a shared field gets duplicated across billions of rows. As datasets grow, storage usage increases, typically 10 times or more, compared to normalized models.
- Pipeline complexity: Denormalization requires orchestration-heavy pipelines that introduce additional failure points, slow down development, and create tight coupling between analytics logic and data engineering workflows. Over time, they become the primary source of maintenance overhead.
- Stale insights: Denormalized pipelines are especially challenging with fresh data. Supporting low-latency updates typically requires rebuilding the entire pipeline within a streaming framework, which involves rewriting joins and handling state. These systems are complex to operate and debug at scale. As complexity increases, many teams revert to batch workflows and accept delayed data as a trade-off for reliability.
- Schema rigidity: Want to add a new dimension? That means updating the join logic, reshaping the denormalized output, and backfilling historical partitions. The larger the data, the more time-consuming and risky the operation.
In internal BI workflows, these compromises might be tolerable. But when analytics is embedded in the product — visible to customers, partners, or sellers — any delay, mismatch, or failure becomes a user experience issue.
What It Takes to Make on-the-fly JOINs Work
Skipping denormalization sounds appealing — but only works if the engine can handle joins and aggregations reliably at scale. That requires more than just SQL syntax.
Architectural capabilities matter:
- A cost-based optimizer that can choose efficient join strategies based on up-to-date table statistics, not static rules.
- A distributed execution model (MPP) that pushes joins and aggregations across the cluster, allowing joins to scale with the size of your data.
With the right foundation, normalized workloads can scale in production. Demandbase shows what that looks like in practice.
Real-World Example: Demandbase
Demandbase — a leading B2B go-to-market platform — initially relied on ClickHouse, where limited JOIN performance forced them to denormalize everything. The results:
- A significant waste of storage, more than 10 times the required space.
- Real-time fresh data was impractical due to the complexity of the denormalization pipelines.
- Added overhead, demanding more computing resources and infrastructure.
After switching to CelerData (powered by StarRocks), they enabled runtime JOINs over normalized data, eliminating denormalization in most cases. The impact:
- 60% reduction in infrastructure requirements
- 90% lower storage costs
- Eliminating the need for heavy ETL with a greatly simplified data pipeline
Want to Learn How?
Our latest customer-facing analytics white paper walks through the system design behind scalable customer-facing analytics, including:
- How to make joins fast and scalable
- How to ensure consistent performance under load
- Two reference architectures: one for real-time customer-facing analytics with low-latency ingestion, and one for lakehouse-native workloads prioritizing governance and open table formats.
Learn how to ditch denormalization. Read the full white paper here.