The ultimate goal of real-time analytics is to minimize the time for actionable insights. To achieve this, two factors need to be considered: speed and data freshness. The faster data can be accessed and analyzed; the quicker decisions can be made. Data engineers currently rely on expensive data pipelines to achieve fresh real-time insights. This blog will delve into why data pipelines hurt your business and how to solve this problem.
Data pipelines, enriched with processes such as denormalization and preaggregation, aim to accelerate data retrieval. By flattening data on ingestion, large operations are more efficient during retrieval. These transformations are tedious for batch analytics but are a larger stumbling block for real-time analytics due to their dynamic nature and the associated complexity of real-time data pipelines.
Real-time data pipelines have the potential to:
Delay data freshness The techniques of denormalization and preaggregation introduce a time gap between data creation and its availability for querying, which might compromise the real-time aspect of analytics.
Limit flexibility Once constructed, pipelines often lack flexibility and prove challenging to modify when business needs evolve. Changes to the data model or pipelines often require extensive reengineering.
Complicate the system Adding more components to the data pipeline increases the system's complexity, increasing the chance of things going wrong. Each additional component is a potential point of failure that can disrupt the entire pipeline.
Incur high costs The complexity of real-time pipelines necessitates specific coding skills. This requirement escalates operational costs and creates additional possibilities for errors delaying the time to actionable insight.
Recognizing these limitations, the StarRocks project was born. With StarRocks, you no longer need to build intricate data pipelines at the cost of performance.
No more denormalization: StarRocks' incredible JOIN performance enables users to perform JOINs on-the-fly, instead of denormalizing somewhere in their data pipeline. This streamlines analytics, reduces complexity, and enhances efficiency.
Manage preaggregation internally: StarRocks further simplifies the data pipeline by managing preaggregation within the system itself. This eliminates another tedious step in the data preparation process, making analytics faster and more accurate.
Airbnb's advanced metrics management platform, Minerva, provides over 30,000 different metrics across 7,000 dimensions and stores more than 6 petabytes of data. This system enables users to establish metrics once and then apply them everywhere. With applications ranging from A/B testing to in-depth data exploration, Minerva caters to a hundred different teams for their data analysis needs.
Previously, Minerva used Apache Druid and Presto as their query layer. Airbnb engineers had to denormalize data in a separate data pipeline and ingest it into Minerva for serving to work around these systems' multi-table query performance. This denormalization data pipeline was resource-intensive and expensive to maintain. Schema changes were extremely time-consuming, so adding new metrics took hours or even days depending on the amount of data that needed to be backfilled.
To reduce the cost of the system and increase its flexibility and efficiency, Airbnb migrated Minerva from Presto and Druid to StarRocks. Because of StarRocks' exceptional JOIN performance, Airbnb engineers can maintain the tables in a snowflake schema and perform JOINs on-the-fly at query time.
Using StarRocks and snowflake schema liberates Minerva engineers from the time-consuming and complicated task of denormalization. This means engineers no longer need to backfill data or reconstruct tables when there's an update to a metric, resulting in substantial resource savings.
Read the complete Airbnb Minerva's success story with StarRocks here.
If you're looking to supercharge your real-time analytics while keeping processes streamlined and costs under control, consider giving CelerData Cloud a try. Unleash the potential of your analytics and redefine your real-time data pipeline strategy with a 1-month free trial here.