Join StarRocks Community on Slack

Connect on Slack

TABLE OF CONTENTS

See All Glossary Items

What is Embedded Analytics and How Does It Work

Delivering Customer-Facing Analytics: Best Practices

A Guide to External-Facing Analytics

Steps to Successfully Implement Customer-Facing Analytics

From Theory to Best Practices: Single Instruction, Multiple Data (SIMD)

Publish date: Oct 16, 2024 2:18:54 PM

If you’ve ever pushed a fix, launched a promo, or turned on a fraud rule and then waited hours for the dashboard to catch up, you’ve met the enemy: latency between when data happens and when you can safely act on it.
Real-time analytics is a design choice that collapses that gap. It blends fast ingestion, low-latency queries, and the ability to join today’s events with historical context, so you can decide in seconds—not tomorrow morning.

Think of it this way: batch analytics tells you what happened; real-time analytics helps you decide what to do next while the window is still open.

What is real-time analytics?

Real-time analytics is the continuous process of:

Ingesting fresh data (streams or micro-batches),
Making it queryable quickly (usually within seconds), and
Answering complex questions—filters, multi-table joins, aggregations, and windowed calculations—at interactive latency for many concurrent users.

Key properties

Freshness target: seconds to tens of seconds (sometimes sub-second for point lookups or gating decisions).
Query shape: not just counts—multi-table joins, ad-hoc filters, group-bys, windowed metrics (time windows, session windows, etc.).
Users: both humans (dashboards, investigations) and services (features, risk checks, personalization APIs).
Data mutability: supports upserts/deletes (orders change, statuses flip, profiles update). Append-only fits some cases; many operational analytics require safe mutation.

What “real-time” means here: the end-to-end latency a user or service experiences—from the moment an event occurs to the moment it’s queryable and a result is returned. Analytics systems track two clocks: event time (when it happened) and processing time (when the system observed and made it available). Define your SLOs in terms of processing time / end-to-end latency, because that’s what drives data freshness and user experience.

Don’t Confuse Real-Time Analytics With These

Real-time analytics vs. streaming analytics

Streaming analytics (Flink/Spark/Kafka Streams) = how you compute on flowing data (enrich, window, detect patterns) and emit results.
Real-time analytics = how you serve fresh data to people and services with low-latency, multi-table queries (today’s events + history, on demand).
In practice: streaming often feeds the real-time store; modern engines can also ingest streams directly and compute metrics at query time.

Real-time analytics vs. real-time dashboards

A real-time dashboard is a UI pattern.
Real-time analytics is the backend capability (correct, fast, joinable data for dashboards, APIs, alerts). A flashy UI with stale data isn’t real-time.

Real-time analytics vs. CEP (complex event processing)

CEP finds patterns in the stream (“A then B within 30s”).
Real-time analytics may consume CEP outputs but still needs a store that can join detections with dimensions/history—interactively.

Real-time OLTP vs. real-time OLAP

OLTP: transactions, point lookups—great for single records.
Real-time OLAP: columnar + MPP for scans/joins/aggregations at interactive speed. Modern systems add primary-key upserts so mutable facts don’t wreck scan speed.

Latency cheat sheet (pick an SLO per use case)

Interactive: <1–3s (drill-downs, risk gates, feature APIs)
Near-real-time: 3–30s (continuous dashboards)
Micro-batch: 30–300s (overviews; too slow for incidents)

Common gotchas

“We have Kafka, so we’re real-time.” → Kafka moves events; you still need a query engine for fresh, ad-hoc, multi-table analysis.
“We use wide tables, so it’s fast.” → Until the schema or questions change. Engines that join on the fly avoid constant backfills.
“Upserts kill columnar performance.” → Often true—unless your OLAP supports primary-key upserts designed for columnar storage (e.g., delete-and-insert with efficient delete markers + PK indexes), so freshness and speed can coexist.

The top 7 benefits of implementing real-time analytics

1) Make decisions while the window is still open

Real-time analytics turns “we’ll know tomorrow” into “we know now.” Product and marketing teams can compare what just happened with months of history, filter by customer or campaign, and change tactics while spend or demand is unfolding. This only works if the system can join fresh facts with historical context on demand (not just count events on a single table).

In practice, advertiser platforms like Pinterest reported roughly 50% lower p90 latency while running on about 32% of the instances and keeping ~10-second freshness after moving to a modern OLAP engine that supports rich SQL (joins, subqueries, materialized views). That combination made mid-campaign optimization a routine habit rather than a post-hoc analysis.

Helpful guardrails: track p90/p95 latency during peak and define a freshness SLO (processing time) that’s measured end-to-end, not just per component.

2) Risk controls that evolve as fast as threats

Fraud and abuse patterns shift in minutes. If balances, device reputation, or status flags are stale, you either miss losses or over-block legitimate users. The key is keeping mutable signals current without slowing analytics. Primary-key upserts designed for columnar engines—often delete-and-insert with efficient delete markers and a PK index—keep writes fast and reads vectorized.

In high-QPS risk-feature services, teams report ~1 s ingestion, sub-second point lookups, and >1,000 QPS on feature queries—exactly the profile needed for online gating.

Helpful guardrails: measure time from event to enforced decision, watch tail latencies under traffic spikes, and compare false-positive/negative rates before vs. after going fresher.

3) Fewer systems, fewer backfills, lower cost

Running a “speed layer” alongside a batch warehouse doubles logic and doubles places things drift. A single analytical surface that serves both fresh and historical queries simplifies operations: fewer pipelines to maintain, fewer reconciliations, and less redundant storage.

A payments platform that consolidated a Kudu/HBase/Hive/Elasticsearch stack achieved p95 latency under one second, ingestion of millions of rows per minute, and millisecond-level aggregations on roughly 30 TB of data through asynchronous materialized views — along with simpler operations and faster recovery.

Helpful guardrails: track infra spend per 1,000 queries, “pipelines per metric” (aim for one), and time-to-ship a new metric or dimension.

4) Concurrency that turns analytics into a product feature

It’s one thing for a single analyst to run a query; it’s another for thousands of users or API calls to explore at once without timeouts. Product-scale real-time analytics relies on a cost-based optimizer, runtime filters, and the ability to switch join strategies (broadcast/shuffle/colocated) under load.

That’s how advertiser tools like Pinterest keep join-heavy queries interactive under load, instead of forcing everything into brittle wide tables. You also see this in Celonis, a process mining company: their dashboards operate over tens of billions of rows, leaning on join reordering, colocated joins, and partial-column updates to keep pages usable; they report P90 ≈ 20 seconds for complex dashboards while maintaining fresh data. In Web3 investigations, similar techniques keep multi-hop (graph-like) traces responsive by shrinking what each join has to scan at every step.

Helpful guardrails: Validate concurrency using your peak patterns (filters + joins, not just scans), watch tail latency and timeout rates, and be ready to switch among broadcast, shuffle, and colocated joins as data shape shifts.

5) Faster iteration—without backfill purgatory

Questions evolve. If each change forces a wide-table rebuild, iteration stalls. A sturdier pattern is two-layered:

Keep detail tables normalized so you can slice by any new dimension immediately.
Accelerate hot paths with asynchronous materialized views (MVs) for common aggregates. With automatic query rewrite, users get MV speed without losing ad-hoc freedom.

Advertiser and payments teams highlight this combo: standard SQL (joins, subqueries, MVs) lets analysts move without waiting for bespoke pipelines, and async MVs keep aggregates fresh while detail remains live for drill-downs.

Helpful guardrails: Track time to add a metric/dimension, number of backfills per week, and engineer hours per iteration; reserve MVs for queries that consistently dominate workloads.

6) Freshness and accuracy for changing records

Orders update, sessions merge, inventory shifts, account statuses flip. Real-time holds together only if those changes land quickly and don’t turn every read into an expensive merge. Engines designed for this pattern typically use delete-and-insert upserts with a primary-key index so scans stay vectorized and predictable as data churns—enabling frequent updates with fast ad-hoc joins, without nightly rebuilds.

Helpful guardrails: Monitor write throughput vs. query latency, p95 freshness during spikes, and storage overhead from tombstones; tune compaction windows so freshness doesn’t drift.

7) One place to ask questions—hot signals and deep history

Analysts and services shouldn’t care where bytes live. The durable pattern is one analytical surface: keep hot, mutable facts in the native store for interactivity; read history directly from the lake (e.g., Apache Iceberg) for long lookbacks; join across when needed.

Teams in Web3 and financial analytics use this to run multi-hop traces and market analytics without copying everything into yet another store; payments teams describe similar plans to federate queries across lakehouse tables and native stores so users don’t care where the data resides—the system routes the query to the best source automatically.

Helpful guardrails: Measure the fraction of cross-tier queries meeting SLA, keep duplication ratios low, and promote slow, high-fanout lake queries into MVs/summary tables when patterns justify it.

How Do You Turn Real-Time Analytics Benefits Into Reality?

To actually realize these benefits, your stack (and habits) should do the following:

Ingest in seconds and measure freshness as end-to-end processing time—not “events reached Kafka.” This enables mid-campaign or in-flight optimization (as seen in large advertiser stacks).
Answer join-heavy questions at interactive speed. Rely on a cost-based optimizer, runtime filters, and multiple join strategies; this is core to product-scale dashboards and investigations (e.g., Celonis’s process-mining workloads).
Support safe mutation without slowing reads. Use a primary-key model with delete-and-insert upserts and efficient delete markers so scans stay vectorized and predictable as data changes.
Keep details flexible; accelerate hot paths. Keep facts normalized for exploration; add asynchronous MVs for the aggregates everyone relies on—let query rewrite hit them automatically.
Unify fresh + historical analysis on one surface. Hot signals where queries fly; deep history directly from the lake, joined as needed—common in Web3 tracing and market analytics.
Prove concurrency with your real patterns. Validate p90/p95 under your peak mix of filters + joins; instrument timeouts; be ready to shift between broadcast/shuffle/colocated joins.
Retire duplicate “speed layers” as confidence grows. One analytical surface reduces drift, backfills, and cost (payments and risk examples reflect this).

If you want one system that aligns with the checklist above, StarRocks is a strong match: vectorized MPP execution with a cost-based optimizer and runtime filters for low-latency joins and high concurrency; primary-key upserts (delete-and-insert) with efficient delete markers for mutable data; asynchronous materialized views with query rewrite for fast aggregates without wide-table lock-in; and the ability to query lakehouse tables (e.g., Iceberg) alongside hot native storage. These capabilities are why it shows up in production across advertiser analytics (e.g., Pinterest), process mining (Celonis), Web3 investigations, and payments.

FAQ: Real-Time Analytics

1) What does “real-time” actually mean—sub-second everywhere?
No. Set SLOs by business need. Interactive drill-downs and risk gates often need <1–3 s; continuously refreshed ops views can tolerate 3–30 s; high-level overviews may accept micro-batches (30–300 s). Measure end-to-end processing time (event → queryable → result), not just “events reached Kafka.”

2) Do I need a streaming engine to get real-time analytics?
Streaming engines (Flink/Spark/Kafka Streams) compute on flowing data. They’re great for enrichment/windows/alerts and often feed your analytical store. Many modern OLAP engines can also ingest streams directly and compute a lot at query time. Use streaming where you need continuous transforms; use the OLAP layer to serve multi-table, ad-hoc questions fast.

3) Wide denormalized tables vs. joins on the fly—what’s safer?
Wide tables are fast until the schema or questions change; then you’re backfilling and rematerializing. A more durable pattern is flexible detail tables + asynchronous materialized views (MVs) for hot aggregates. With query rewrite, users get MV speed but keep ad-hoc freedom. Payments teams report millisecond-level aggregates on ~30 TB via async MVs while sustaining P95 <1 s.

4) Can columnar OLAP really handle updates/deletes without slowing down? How?
Yes—via primary-key upserts implemented as “delete-and-insert.” Efficient delete markers (e.g., roaring-bitmap tombstones) plus a primary-key index (in-memory or on-disk) let writes land quickly while scans remain vectorized, so reads don’t pay a per-query merge tax. This is why teams can keep mutable signals (balances, statuses) fresh and queryable.

5) How should I validate concurrency for user-facing analytics?
Test with your real query shapes (filters + multi-table joins), not just straight scans. Watch p90/p95 and timeouts during peak. Optimizers that reorder joins, push down predicates, and apply runtime filters matter a lot, especially when dashboards fan out into many subqueries. Celonis cites the need for a strong cost-based optimizer, column histograms, runtime filters, and multiple join strategies (broadcast, shuffle, colocated) to keep complex dashboards usable over tens of billions of rows.

6) How do I keep risk controls as fresh as my attackers?
Aim for second-level ingestion, sub-second point lookups, and thousands of QPS for feature queries. In production risk-feature services, teams report ~1 s ingestion with high-QPS lookups; at DiDi, internal tests showed 20–30 ms lookups on indexed queries and sustained ~1.1k–1.5k QPS on a modest cluster, confirming suitability for real-time gating.

7) How do I collapse the “speed layer” and cut backfills?
Unify fresh and historical analysis on one surface. Keep hot, mutable facts in the analytic engine for interactivity; read deep history directly from the lake (e.g., Iceberg) and join across when needed. Web3 stacks emphasize second-level freshness with complex multi-table joins and the option to query Iceberg directly for history. This reduces duplicate pipelines and reconciliation work.

8) What metrics should I instrument to know “it’s working”?
Track end-to-end freshness (p50/p95), query tail latency (p90/p95), timeout rate, cost per 1,000 queries, pipelines-per-metric (aim for one), backfills/week, and “time to add a metric/dimension.” In risk/ops, also watch decision latency and false-positive/negative rates before/after going fresher.

9) Real examples—what improvements do teams actually see?

Pinterest (advertiser analytics): p90 latency down ~50%, same analytics on ~32% of prior instances, ~10-second freshness after migration—freeing teams to optimize mid-campaign.
Celonis (process mining): dashboards over tens of billions of rows stay usable thanks to join reordering, runtime filters, and colocated joins.
VBill (payments): P95 <1 s and P98 <3 s with async MVs; materialized aggregates over ~30 TB; redundant storage cut ~20%; concurrency scaled from lag at 10 to smooth at 100+ concurrent queries.

10) Where does the data lake fit—won’t that be slow?
Keep “hot” operational facts in the analytical engine for sub-second interactivity; push long lookbacks to the lake with pruning and predicate pushdown. Some engines can query Iceberg/Delta/Hudi directly, letting you balance cost and latency without copying everything into another store.

11) Is real-time analytics only for dashboards?
No—teams increasingly expose analytics as a product feature (APIs, advertiser tools, fraud checks). That requires concurrency and consistent tail latency under load, not just one analyst’s query. Pinterest’s Partner Insights is a concrete example of advertiser-facing analytics running at scale with rich SQL (joins, subqueries, materialized views).

12) If we want a single system that matches these requirements, what should we look for?
You want: (1) second-level ingestion; (2) interactive joins at scale via a strong optimizer, predicate pushdown, runtime filters, and multiple join strategies; (3) primary-key upserts with efficient delete markers so mutable facts don’t slow scans; (4) asynchronous MVs with query rewrite; and (5) the ability to query lakehouse tables alongside hot storage. Systems that check those boxes are showing up in production across advertiser analytics, process mining, Web3 investigations, and payments.