
Join StarRocks Community on Slack
Connect on SlackPicture this: It's 2:47 PM EST, and the Fed just announced an unexpected rate cut. Within seconds, your trading desk is flooded with order flow as algorithms and traders react to the news. But your risk dashboard still shows yesterday's positions, your P&L calculations are running on 15-minute-old prices, and your traders are making million-dollar decisions based on stale data.
This scenario plays out daily across financial institutions. While markets move in milliseconds, most data architectures move in hours. Traditional batch-oriented warehouses, stitched together with overnight ETL jobs and periodic cache refreshes, simply cannot deliver the real-time insights that modern trading demands. The result? Blind spots in risk exposure, delayed P&L recognition, missed arbitrage opportunities, and regulatory reporting gaps that could prove costly.
If you're a data engineer at a financial institution or fintech struggling with real-time position tracking, a quantitative analyst frustrated by stale market data feeds, or a head of trading technology looking to modernize legacy risk systems, this article is for you.
The fundamental problem isn't just speed. It’s the architectural mismatch between how markets operate (continuous, streaming, interconnected) and how most financial data systems work (batch, periodic, siloed). Every trade impacts portfolio exposure instantly, yet traditional systems require hours of ETL processing before that impact becomes visible in dashboards and risk models.
In this article we'll explore strategies for building a modern financial data warehouse designed for real time analytics and business intelligence. More specifically, from the perspective of equities trading. We'll cover high level use cases and then walk through a simple stock demo powered by StarRocks, showing how streaming data, materialized views, and external catalogs enable live portfolio and P&L aggregations. The full demo is available at the GitHub link at the end. As a preface, the catalog and database names of the demo have been removed from this article to reduce confusion, but are present when needed.
High-Level Strategy
Financial institutions depend on turning raw market data into actionable insights at speed and scale. Whether it is a trading desk monitoring exposure, a portfolio manager recalculating P&L, or a compliance officer reviewing activity, the ability to work with accurate and up-to-the-second data is no longer optional.
Markets move in seconds, not hours, so trades and prices must be ingested and available for querying the moment they arrive. Every transaction impacts a portfolio's exposure and unrealized P&L, which means risk metrics and positions must be recalculated continuously. Delays can leave traders and managers blind to sudden shifts or concentration risks.
Analytics also requires context. Real-time trades and prices alone are not enough without reference data such as security details, books and desks, or trader assignments. A modern data warehouse must unify these sources so that analytics reflect both the latest market events and the organizational structures around them.
Finally, data must be accessible and trustworthy. Traders, analysts, and managers expect insights delivered through BI tools like Tableau, Looker, Metabase, or Grafana, or through internal dashboards. They should not have to wrestle with raw schemas or complex queries to answer a basic business question. At the same time, regulators and executives need to trust the data's timeliness, which means latency and freshness must be measurable and visible.
Together, these requirements define the foundation for a modern financial data warehouse: real-time ingestion, continuous portfolio monitoring, unified reference data, easy accessibility, and transparent governance. With this background, lets understand how these pieces tie together with architecture.
From Strategy to Architecture
Suppose we had a normalized data model as below. Those in front office data engineering or trading would be familiar with the context of these tables.
The heart of all of your securities knowledge begins with reference data, or refdata for short. Reference data is the meat and potatoes that brings context to individual instruments that are purchased or sold. Fields like isin and cusip bring universal identifiers that can be recognized domestically and internationally beyond tickers like "AAPL" or "MSFT", and descriptions of the security and exchange or currency information is vital for pricing, settlement, and valuation purposes.
The next layer of the model revolves around trades. Each trade record captures the essential facts of execution: which symbol was transacted, at what price, in what quantity, and whether it was a buy or sell. Additional metadata like order IDs, fill numbers, and timestamps ensure traceability, reconciliation, and support for regulatory reporting. In practice, this table becomes one of the highest-velocity sources of data in the system, and one that downstream analytics rely on for accurate, timely insights into portfolio positions and risk.
Trades, however, don't exist in isolation. They are grouped and managed through books, which act as logical containers representing strategies, desks, or portfolios. Books provide the financial lens through which trades are aggregated, P&L is measured, and risk exposures are evaluated. Sitting alongside books are the traders themselves, who are associated with books via a booktraders relationship that captures roles and accountability. This linkage makes it possible to slice performance not just by instrument or desk, but down to individual decision-makers.
Complementing trades and books is the prices table, which provides the market lens. While trades record what was executed, prices tell us the current valuation environment. By continuously updating the latest quotes for each symbol, this table fuels real-time dashboards, mark-to-market valuations, and P&L calculations. Without fresh prices, even perfectly reconciled trades are blind to what the portfolio is worth today.
Taken together: reference data, trades, books, traders, and prices represent a microcosm of how financial institutions organize and interpret their data. It's normalized to reduce redundancy, but more importantly, it encodes the relationships that matter: who traded what, in which book, at what price, and what it's worth now. This is the foundation on which the rest of the architecture builds, and the place where strategy starts becoming operational reality.
Analytical Layer
While the schema provides the foundation for storing and organizing trade, price, and reference data, the real value emerges when we transform this raw data into actionable insights. In financial services, this often means combining high-frequency trade data with slower-moving reference data, then surfacing it in dashboards or risk models that can guide decision-making in real time.
StarRocks is designed for this analytical layer. Its architecture enables low-latency queries on streaming and historical data simultaneously, which is essential when you need to ask questions like:
- What is the real-time P&L across my trading books?
- Which trades are driving risk exposure under current market conditions?
- How fresh is the market data that my dashboards are displaying?
So how can we turn our enhanced normalized data model with the feature-rich ecosystem of StarRocks?
Next-Generation Features That Bring Data Practitioners Success
StarRocks brings several key capabilities that make it particularly well-suited for financial data workloads. Let's explore how each feature addresses the specific challenges of modern trading environments.
Data Modeling and Table Maintenance
When you’re dealing with market data, you quickly realize that not all tables are created equal. Prices move fast, and depending on whether you care about every tick or just the latest snapshot, your schema design will look very different. StarRocks gives us a flexible toolkit with different table types, so we can tailor our storage strategy to the workload at hand.
For our use case, we’re ingesting a real-time firehose of pricing data. Every tick matters because downstream consumers will be charting, aggregating, and analyzing at a very granular level. That’s why we went with a Duplicate Key table.
CREATE TABLE prices (
symbol STRING,
price_timestamp DATETIME,
price DECIMAL(18,8)
)
DUPLICATE KEY (symbol, price_timestamp)
PARTITION BY date_trunc('DAY', price_timestamp)
DISTRIBUTED BY HASH(symbol)
BUCKETS 8
PROPERTIES (
"colocate_with" = "symbol_group"
);
Here’s the logic: the duplicate key definition lets us record every tick that comes in, even if two prices land at the exact same timestamp. By choosing (symbol, price_timestamp) as the compound key, we get a natural sort order that makes time-based filters efficient. Partitioning by day keeps queries snappy. If you only care about today’s trading session, StarRocks can prune away yesterday’s data entirely. Bucketing by symbol helps distribute the load evenly, and by colocating symbol-based tables, we make joins cheap and local. This design keeps our data high-fidelity and our queries fast.
But that’s just one path.
If instead we only cared about a snapshot of the latest price, we’d reach for a Primary Key table.
CREATE TABLE prices (
symbol STRING,
price_timestamp DATETIME,
price DECIMAL(18,8)
seqId STRING
)
PRIMARY KEY(symbol, price_timestamp, seqId)
DISTRIBUTED BY HASH(symbol);
With a primary key, you get uniqueness enforcement. That means you can choose:
- (symbol) as the key: keep just the most recent price per symbol, overwriting old ones.
- (symbol, price_timestamp, seqId) as the key: store multiple ticks at the same timestamp while still ensuring uniqueness.
The former is great for dashboards or applications where you just need “what’s the price now?” The latter is a middle ground where you keep some history but still constrain duplicates.
So which one’s right?
- Duplicate Key: best for time-series analysis and retaining every tick.
- Primary Key: best for snapshots or situations where we have heavy read-write workloads
In our case, the need for granular historical data made Duplicate Key the clear winner. But knowing that we can pivot to Primary Key tables if the use case changes gives us flexibility in data architecture.
Asynchronous Materialized Views
One of StarRocks’ most powerful features for financial analytics is the ability to create asynchronous materialized views. These views act like pre-computed shortcuts: instead of repeatedly running the same heavy query and consuming system resources, we can store the result set and have it refresh automatically as new data arrives. The payoff is huge, as sub-second queries on complex aggregations that would normally scan millions of rows.
Take the challenge of calculating a volume-weighted average price, or VWAP over the buys in a portfolio (excluding short positions to keep values positive).
On paper it looks simple, but in practice you might be aggregating across millions of trades spread over multiple books and symbols over and over again.
Doing that in real time on raw tables is expensive!
That’s where our first materialized view comes in:
CREATE MATERIALIZED VIEW positions_base_mv
PARTITION BY symbol
REFRESH ASYNC
AS
SELECT
t.book_id,
t.symbol,
SUM(CASE WHEN t.side = 'BUY' THEN t.quantity
WHEN t.side = 'SELL' THEN -t.quantity END) AS net_quantity,
SUM(CASE WHEN t.side = 'BUY' THEN t.quantity * t.price END)
/ NULLIF(SUM(CASE WHEN t.side = 'BUY' THEN t.quantity END), 0) AS avg_price
FROM trades t
GROUP BY t.book_id, t.symbol
HAVING SUM(CASE WHEN t.side = 'BUY' THEN t.quantity
WHEN t.side = 'SELL' THEN -t.quantity END) > 0;
The materialized view can refresh as often as we, the data practitioner, would like. With the current strategy implemented, we’re refreshing every time StarRocks detects the base table as changed. Our near-realtime situation doesn’t have to behave this way, especially if a tool like Tableau or Grafana has its own preset interval of 60 seconds. To that end, we could replace our refresh policy to say something like this:
REFRESH ASYNC START("2025-XX-XX 10:00:00") EVERY (INTERVAL 1 MINUTE)
This view reduces all trade-level data into per-book, per-symbol positions. The asynchronous refresh ensures we’re not re-computing the world on every query, but still keeping the numbers timely. It’s important to realize that your refresh policy also depends on how big your partition is in size, so it's critical to carefully monitor your partition retention policies as they change over time.
We can then layer on top of this with a second materialized view that joins positions with live prices:
CREATE MATERIALIZED VIEW positions_with_price_mv
DISTRIBUTED BY HASH(symbol) BUCKETS 8
REFRESH ASYNC
PROPERTIES ( "colocate_with" = "symbol_group" )
AS
SELECT
p.book_id,
p.symbol,
p.net_quantity,
p.avg_price,
pr.price AS current_price
FROM positions_base_mv p
LEFT JOIN prices pr;
Because both tables share the same bucketing strategy and colocation group, the join stays local with no shuffling data across nodes. This two-tier design gives us fast, incremental refreshes: positions update when trades happen, and prices update independently as market ticks roll in.
The result? We avoid heavyweight joins over raw data, keep refresh cycles lightweight, and still meet the real-time SLAs demanded by financial BI and risk systems.
External Catalog and Open Table Format Support
Financial institutions often have reference data managed in separate systems such as data lakes, data warehouses, or specialized vendor feeds. StarRocks' External Catalog feature allows you to query this data in-place without ETL processes out of the box.
In our demo, reference data, books, and trader information are stored in an Iceberg table format via AWS Glue catalog, but one could use any of the StarRocks supported open table formats such as Hive, Hudi, Delta Lake, JDBC, or Paimon coupled with other metastores like Hive and Tabular.
CREATE EXTERNAL CATALOG iceberg_catalog_glue
PROPERTIES
(
"type" = "iceberg",
"iceberg.catalog.type" = "glue",
"aws.glue.use_instance_profile" = "true",
"aws.glue.region" = "<insert-region-here>",
"aws.s3.use_instance_profile" = "true",
"aws.s3.region" = "<insert-region-here>"
);
This lets us create a unified view that combines real-time trading data with static reference data:
CREATE MATERIALIZED VIEW positions_pnl_view
DISTRIBUTED BY HASH(symbol) BUCKETS 8
REFRESH ASYNC
AS
SELECT
p.book_id,
b.book_name,
bt.trader_id,
t.name AS trader_name,
p.symbol,
r.description AS security_description,
p.net_quantity,
p.avg_price,
p.current_price,
(p.current_price - p.avg_price) * p.net_quantity AS unrealized_pnl
FROM positions_with_price_mv p
LEFT JOIN iceberg_catalog_glue.refdata r ON p.symbol = r.symbol
LEFT JOIN iceberg_catalog_glue.books b ON p.book_id = b.book_id
LEFT JOIN iceberg_catalog_glue..booktraders bt ON p.book_id = bt.book_id
LEFT JOIN iceberg_catalog_glue.traders t ON bt.trader_id = t.trader_id;
The secmaster and organizational data can live upstream in your existing data lake, while real-time analytics happen in StarRocks. No data movement required.
Ingestion Capabilities for Kafka via Routine Load
In today’s use cases at the tick level, data typically arrives via streaming platforms like Kafka or licensed distributed messaging buses. StarRocks' Routine Load feature provides native Kafka and Flink integration that automatically ingests streaming data with exactly-once semantics.
CREATE ROUTINE LOAD trades_routine_load ON trades
COLUMNS(trade_id, order_id, book_id, symbol, trade_timestamp, side, price, quantity, fill_number, total_fills, fills_remaining)
PROPERTIES
(
"desired_concurrent_number"="3",
"format" ="json"
)
FROM KAFKA
(
"kafka_broker_list" ="<host>:9092",
"kafka_partitions" ="[0,1,2,3...X]",
"kafka_topic" = "<insert-topic-here>",
"property.kafka_default_offsets" = "OFFSET_BEGINNING"
);
Routine Load handles backpressure, retries, and schema evolution automatically, so data becomes queryable within seconds of arriving in Kafka. With fine tuning parameters such as task allocation and partitioning mapping across a topic, you can effectively parallelize your streaming workloads for feeds like trades and pricing analytics.
Legacy architectures of today require you to design complex streaming pipelines that force you to integrate in the backend even for business intelligence use cases. When your end users want to talk to you data directly, StarRocks gives you the ability to remove any unnecessary middleware that interrupts dashboarding and ad hoc workflows.
The Sum of Parts
As we see now, we get a sum of parts architecturally as follows:
Native Integration with Business Intelligence Tools
Lastly, StarRocks provides native connectors for popular BI tools, ensuring that your financial data can power existing dashboards and reporting workflows. In our demo architecture, we used Grafana to visualize real-time P&L and position data, but StarRocks equally supports:
- Apache Superset for open-source business intelligence
- Tableau for enterprise visualization and self-service analytics
- Metabase for rapid dashboard creation
- Power BI for Microsoft-centric environments
The key advantage is that these tools can query StarRocks directly using standard SQL, without requiring specialized APIs or crazy connectors. Analysts can build dashboards using familiar tools while automatically benefiting from StarRocks' real-time data freshness and query performance.
Try for Yourself
Want to see these concepts in action? We've created a lightweight market simulation demo that showcases all the StarRocks features discussed in this article on CelerData BYOC. The demo includes:
- Market data simulator generating realistic trades and price movements
- Kafka streams feeding data into StarRocks via Routine Load
- Materialized views calculating real-time positions and P&L
- External catalogs integrating with AWS Glue for reference data
- Grafana dashboards visualizing live trading metrics
https://github.com/ronkapoor86/trading-demo
The demo is perfect for exploring how StarRocks can transform your financial data architecture without requiring complex infrastructure setup, and allows you to scale in a sandbox that's easy to understand.
If you have any questions or want to be a part of the community that’s changing real time analytics, join the StarRocks community on Slack → starrocks.io/slack.