Our February edition is a mix of production wins, engine internals, and community deep dives—with a look at where things are heading: sub-second analytics serving both end users and AI agents at scale—from the same engine.
This month: one team shares how they rebuilt their data platform on CelerData Cloud to power both interactive dashboards and LLM-driven workloads—including their MCP server + OpenAI integration in production. Others walk through unifying fragmented Trino and ClickHouse stacks, hitting sub-second on Iceberg with materialized views, and building governed analytics with dbt. Plus a deep dive on vectorized execution, a new Snowflake Horizon integration, and community posts on federation and fast joins.
Whether you're rethinking your data infrastructure or exploring what it takes to make analytics agent-ready, this session from Dominik Lange and Uday Rajanna at Conductor is worth your time — a genuine thank you to both for the depth and honesty they brought to it.
Dominik covered the full data infrastructure rebuild that got Conductor to sub-second query performance on large datasets with CelerData Cloud. Uday broke down the agentic layer — MCP Server, OpenAI integration, and the split reasoning architecture that keeps LLMs focused on intent while the data API handles the precision work.
And don't miss the live demo of the Claude integration at the end — an autonomous agent running a full AI search audit in real time!
No shortcuts, no hand-waving — just the full picture!
If you're building analytics into your product—or evaluating real-time cloud data warehouses for customer-facing workloads—this session breaks down what actually matters: interactive performance, multi-table analysis, data freshness, and operational simplicity. Includes a side-by-side comparison of ClickHouse Cloud and CelerData Cloud with a live demo and a real-world production use case.
For teams governing data in Snowflake, CelerData Cloud now integrates with Snowflake Horizon Catalog via the standard Iceberg REST Catalog protocol. You can run sub-second, high-concurrency analytics directly on your Snowflake-governed Iceberg data—no data copying, no separate serving layer. The post walks through the architecture, the multi-layer caching strategy, and what setup looks like in practice.
SmartNews ran ClickHouse for customer-facing advertiser analytics (p95 latency ~100ms, 3TB ingested daily, ~20TB managed) and Trino for ad-hoc queries and ML feature ETL. Maintaining both was costly and complex. After benchmarking against production workloads, they chose CelerData Cloud as a single replacement—achieving 3.6x faster ad-hoc query performance, stable sub-second latency at 800+ QPS, and efficient real-time joins without denormalization. Dennis Zhao walks through the evaluation, the results, and what's next: migrating storage from Hive to Apache Iceberg.
Want to see CelerData Cloud in action? Try CelerData Cloud free for 30 days—no commitment required. Spin up your own environment and put it to the test against your real workloads!
Vectorization gets talked about a lot, but the implementation details matter. Kaisen Kang (StarRocks TSC Member, Query Engine & AI Agent Team Lead) walks through how StarRocks uses CPU SIMD instructions to process multiple data elements in parallel, and why true database vectorization goes well beyond enabling a hardware feature. A good read if you want to understand what's actually happening under the hood when queries return in under a second.
Jacky Wu (dbt-starrocks contributor, Senior Enterprise Solution Manager at SJM Resorts) lays out how to unify data modeling, automation, and analytics into a single framework using dbt, StarRocks, and DataOps practices. The post covers dbt's role in governance automation, how DataOps improves iteration speed and control, and includes real-world case studies showing how the approach works in production for both real-time and batch scenarios.
Simon Späti put together an in-depth look at why StarRocks is gaining traction in the real-time analytics space—with interviews from Eric Sun and Anton Borisov, and production details from Coinbase, Pinterest, and Fresha. Key takeaways: joins are consistently the differentiator (Coinbase's TPC-H 1TB benchmark saw ClickHouse fail 12 of 22 queries), colocated joins are surprisingly simple in concept, Pinterest cut p90 query latency by 50% on 32% of their previous Druid infrastructure, and cold S3 data still returns in 3–5 seconds when Iceberg metadata is well-sorted. A practical, balanced deep dive—including the trade-offs and when other tools might still fit.
NAVER Corp Commerce, one of South Korea's leading e-commerce platforms, needed sub-second analytics on real-time transactional data—~15 analytical dimensions, ~13 metrics, dynamic and unpredictable query patterns, and 7 weeks of historical comparison. 홍남춘 (Namchun Hong) shares how the team built a low-latency platform using Apache Iceberg for storage, StarRocks external catalog with aggressive metadata caching, and StarRocks Materialized Views for pre-aggregated queries. The results speak for themselves: Trino on Iceberg took ~1 minute; StarRocks MVs returned in under 1 second. 90% of production dashboard queries now return sub-second.
Nicoleta Lazar from Fresha digs into query federation: what it is, why it matters for modern OLAP workloads, and how StarRocks approaches it differently from Trino. The post covers the vectorized execution engine, native connectors, deep Apache Iceberg integration, and real-world challenges like schema evolution, file fragmentation, and object-storage latency. She also walks through Fresha's hot/cold data separation strategy and federating additional sources like Elasticsearch, PostgreSQL, and Apache Paimon into a single analytical layer.
Still getting traction from last month—Jesús Gómez-Escalonilla Guijarro (Fresha) shares a clean, repeatable approach to tuning StarRocks queries: start with scans and joins, use plans + profiles to confirm what's really happening, and iterate from there. Plus a shoutout to the Fresha Data Engineering team for building Northstar (open-source)—a plan/profile visualizer that makes bottlenecks easier to spot and improvements easier to validate.
Kaisen Kang (Head of Query & Agent Team, CelerData) shares the 10 core engine capabilities needed to power AI data agents in production—with real examples from StarRocks. Plus talks from Altinity, Grafana Labs, Fivetran, and PostHog on Iceberg, data lake visualization, and AI-ready context.
📍 Austin (Mar 10): Register here 📍 San Francisco (Mar 12): Register here
Confluent's Data Streaming World Tour hits two cities. Seattle (Mar 17) is an AI-focused day with sessions on agentic AI use cases, context engineering, MCP, Agent2Agent, streaming agents on Flink, and a hands-on multi-agent workshop. Jersey City (Mar 26) covers production streaming architectures, lakehouse patterns, and how to build AI agents that ingest, process, and act on streaming data in real time. Both are free, with breakfast and lunch included. Space is limited.
📍 Seattle (Mar 17): Register here 📍 Jersey City (Mar 26): Register here
As a Gold Sponsor of the Iceberg Summit this year, we'll be at the Marriott Marquis in San Francisco on April 8–9. If you're building on Apache Iceberg or working in modern data analytics, this is a must-attend event. We'd love to meet you on the expo floor to talk architecture, swap tuning tricks, or just say hi!
StarRocks Award recipients have been receiving their trophies, and we've already spotted a few on social media. We're here for it. Tag StarRocks or CelerData in your trophy photo — we want to see where Rocky is living now!
If you made it this far, you deserve a trophy as well! 🏆
A quick ask before you go: with so much of the industry shifting toward agentic workflows and real-time context, we want to know what you are building. If you're spinning up a CelerData Cloud trial or pushing our products to their limits with LLMs, let us know what's working, what's surprising you, and where you need more horsepower. The best engine upgrades always come from your toughest production constraints.
And if you've got a story, a tuning trick for feeding context to agents, or a "this took us way too long to figure out" lesson, send it our way. We're always looking to highlight the most useful production learnings from the community.
Here's to a month of sub-second queries, seamless joins, and analytics that actually keep up with your users and your agents. 💙