Customer-Facing Analytics Without Denormalizing Everything

Join StarRocks Community on Slack

TABLE OF CONTENTS

For years, denormalization has been the go-to workaround when query engines couldn’t handle complex joins at scale. Flatten your data, run a nightly pipeline, and pray it doesn’t break. Sound familiar? This approach works — until it doesn’t. As customer-facing analytics becomes more interactive and real-time, the cracks start to show. In this article, we’ll look at why you’d denormalize, what the costs of denormalization are, and, lastly, what options you have for avoiding these costs.

Why Teams Denormalize in the First Place

It’s not because they want to — it’s because they have to. Traditional analytical engines struggle with:

Distributed JOINs across multiple large tables
High-cardinality aggregations
Multi-tenant workloads with unpredictable query shapes

To make dashboards responsive, engineers precompute as much as possible.

But Denormalization Comes at a Cost

Flattening data may save the cost of running joins on the fly, but it introduces deep architectural trade-offs that become painful at scale.

Storage overhead: When fact and dimension data are merged, every row in a shared field gets duplicated across billions of rows. As datasets grow, storage usage increases, typically 10 times or more, compared to normalized models.
Pipeline complexity: Denormalization requires orchestration-heavy pipelines that introduce additional failure points, slow down development, and create tight coupling between analytics logic and data engineering workflows. Over time, they become the primary source of maintenance overhead.
Stale insights: Denormalized pipelines are especially challenging with fresh data. Supporting low-latency updates typically requires rebuilding the entire pipeline within a streaming framework, which involves rewriting joins and handling state. These systems are complex to operate and debug at scale. As complexity increases, many teams revert to batch workflows and accept delayed data as a trade-off for reliability.
Schema rigidity: Want to add a new dimension? That means updating the join logic, reshaping the denormalized output, and backfilling historical partitions. The larger the data, the more time-consuming and risky the operation.

In internal BI workflows, these compromises might be tolerable. But when analytics is embedded in the product — visible to customers, partners, or sellers — any delay, mismatch, or failure becomes a user experience issue.

What It Takes to Make on-the-fly JOINs Work

Skipping denormalization sounds appealing — but only works if the engine can handle joins and aggregations reliably at scale. That requires more than just SQL syntax.

Architectural capabilities matter:

A cost-based optimizer that can choose efficient join strategies based on up-to-date table statistics, not static rules.
A distributed execution model (MPP) that pushes joins and aggregations across the cluster, allowing joins to scale with the size of your data.

With the right foundation, normalized workloads can scale in production. Demandbase shows what that looks like in practice.

Real-World Example: Demandbase

Demandbase — a leading B2B go-to-market platform — initially relied on ClickHouse, where limited JOIN performance forced them to denormalize everything. The results:

A significant waste of storage, more than 10 times the required space.
Real-time fresh data was impractical due to the complexity of the denormalization pipelines.
Added overhead, demanding more computing resources and infrastructure.

After switching to CelerData (powered by StarRocks), they enabled runtime JOINs over normalized data, eliminating denormalization in most cases. The impact:

60% reduction in infrastructure requirements
90% lower storage costs
Eliminating the need for heavy ETL with a greatly simplified data pipeline

Want to Learn How?

Our latest customer-facing analytics white paper walks through the system design behind scalable customer-facing analytics, including:

How to make joins fast and scalable
How to ensure consistent performance under load
Two reference architectures: one for real-time customer-facing analytics with low-latency ingestion, and one for lakehouse-native workloads prioritizing governance and open table formats.

Learn how to ditch denormalization. Read the full white paper here.

Sida Shen

Sida Shen is a contributor to the StarRocks project and a product manager at CelerData. As an engineer with a background in building machine learning and big data infrastructures, he oversees the company’s market research while working closely with engineers and developers across the analytics industry to tackle challenges related to big data and AI.

copy success

Demandbase Ditches Denormalization By Switching off ClickHouse

Demandbase is the leading account-based GTM platform for B2B enterprises to identify and target the right customers, at the right time,...

Sida Shen

2026 Is When Open Data, Real-Time Analytics and AI Agents Converge

For years, the industry has been promised a unified data future, lakehouses built on open table formats, AI agents that operate safely ...

Sida Shen

How Real-Time Analytics Works: A Step-by-Step Breakdown

How does real-time analytics work? How do we seamlessly transition from generating data to making data-driven decisions in a matter of ...

Sida Shen