Airbnb is a world-famous community-based online marketplace for lodging. It provides a platform for hosts to share accommodations with travelers around the world. There are over 6M active listings available on Airbnb platform today, 4M+ hosts provide listings on Airbnb, and over 1B+ Airbnb guest arrivals all-time (as of Sept. 28 2021).
How to efficiently use data to power business at such a large scale has always been a challenge facing the Airbnb Data Infrastructure team. As the requirements for query latency and data freshness demand grows, the legacy data architecture is inadequate to meet new data analytics requirements. StarRocks, a new generation distributed OLAP solution that delivers real-time analytics, came to the rescue.
On March 15, Jingwei Lu, software engineer at the Airbnb Data Infrastructure team, gave a live broadcast on how Airbnb uses StarRocks to power real-time analytics in three typical business scenarios: Tableau dashboard, metrics store (the Minerva project), and trust analytics.
See the full video of the meetup at https://www.youtube.com/watch?v=AzDxEZuMBwM.
Similar to other tech companies, Airbnb's data analytics architecture consists of four layers: data layer, data warehouse layer, OLAP layer, and user layer. The OLAP layer is nearest to the user layer and therefore, must be flexible and powerful enough to meet demanding data analytics needs.
Previously, Airbnb used Apache Druid and Presto as the OLAP solution. However, the two are not satisfactory in query latency and data freshness. Query latency refers to how soon a user can obtain the query result, and data freshness refers to how soon the data can land at the OLAP layer.
Query latency and data freshness of an OLAP system directly affects business logic at the user layer. StarRocks' excellent capability in these two aspects unfold more possibilities for the Airbnb Data Infrastructure team. In this meetup, Jingwei Lu explains the challenges facing Airbnb in three typical analytics scenarios and the value StarRocks brings to Airbnb.
Tableau is a simple, powerful BI tool for data visualization. It is popular among business personnel at all levels. Airbnb has to deal with two challenges if they want to use Tableau. First, Tableau automatically generates complex SQL queries when business personnel perform drag-and-drop operations to generate BI reports.
The OLAP layer must be powerful enough to process and optimize such complex SQL. Apache Druid cannot meet this requirement and therefore, cannot directly connect to Tableau. Second, data records generated in BI scenarios can reach hundreds of millions, even hundreds of billions, which is beyond the processing capabilities of MySQL and Tableau's native storage.
Airbnb also tried Presto, only finding that the latency of complex queries in some cases exceeds 10 minutes. Such long latency is unbearable for users. It undermines interactive experience and renders data analytics meaningless.
All these challenges are addressed after StarRocks is introduced.
StarRocks makes it possible for Airbnb to deliver interactive analytics experience for Tableau users in the case of huge data volume.
Minerva is a unified metrics management platform within Airbnb. It is designed with the vision to allow users to "define metrics once, use them everywhere". Minerva stores over 12,000 metrics and 4,000 dimensions. It is positioned for various data consumption scenarios, such as A/B testing, data exploration, and data analysis. For more information about Minerva, see this blog post.
Previously, to accelerate queries, engineers have to denormalize data in Minerva and then use Apache Druid, Presto, Apache Spark, or Apache Hive to process queries.
As data volume surges, such an architecture will cause huge resource consumption due to the following reasons:
Airbnb is in urgent need of an OLAP system that can break the dilemma and allow for more flexible modeling.
After thorough and in-depth testing, Airbnb found that StarRocks works well with various data models:
With StarRocks, what Airbnb needs to do is pick a data model that best suits their business needs. StarRocks will take care of the rest.
This frees Minerva engineers from tedious and complex denormalization tasks. When a metric is updated, there is no need to backfill data or rebuild tables, saving a lot of resources.
To ensure the legitimate rights and interests of users and the platform, data scientists at Airbnb need to identify violations on the platform in a timely manner to prevent possible losses. This puts forward new requirements for the OLAP system:
Druid and Presto cannot meet these two requirements.
Airbnb built a new Trust Analytics architecture on top of StarRocks. In this architecture:
StarRocks shortens the time required for trust analytics from days to seconds, enabling real-time trust analytics.
StarRocks is well suited for Airbnb's query latency and data freshness requirements, unfolding more business possibilities for Airbnb.
Low query latency is a shared pursuit of all OLAP engines. Airbnb is looking for an OLAP engine that is highly performant in both wide-table and multi-table complex queries. With the guarantee of low latency, business personnel have the freedom to choose between data models and cultivate more analytics scenarios.
Specifically, Airbnb has three types of queries:
Complex queries are usually most flexible and unpredictable, which places high requirements for the functions and performance of an OLAP engine. StarRocks fits these requirements well and closes gaps in data analytics.
At function level:
At performance level, query plan optimization is equally important as query execution speed. An excellent query plan can make perfect arrangements for complex SQL statements, including those generated by the BI system. Ultra-fast query execution is the cornerstone behind flexible data modeling.
With such performance guarantee, data scientists and analysts can focus more on business analytics. StarRocks' cost-based optimizer (CBO) and vectorized execution engine are two important reasons that drive Airbnb to choose StarRocks.
Following is a query case of Airbnb. In this case, a StarRocks cluster built on 20 AWS EC2 i3.8xlarge instances is used (32 vCPUs, 240 GiB of memory, 3 TB SSD); the query involves three tables (one with 500 million rows, one 6 billion rows, and one 100 million rows), 4 joins, 3 distinct counts, and JSON and REGEX functions. StarRocks uses its CBO to reorder joins and push down operators. It takes only 3.6s for StarRocks to return the result, whereas in Presto, the query may take as long as 3-4 minutes.
Data freshness is critical to data analytics. Early data acquisition and analysis mean early opportunity capturing, instant problem solving, and timely actions. Data freshness hinges on how quickly data is imported and how quickly data updates take effect.
StarRocks provides a reliable and scalable import solution that offers the following features:
In terms of data update, the data update cycle of traditional data models is excessively long. Data lake products such as Apache Iceberg, Hudi, and Delta Lake also support data updates but the update cycle may take several minutes. StarRocks offers two data models that deliver dramatic performance uplift:
StarRocks' features and performance, especially real-time updates and efficient processing of complex queries help Airbnb discover new real-time analytics scenarios, eliminate the limitations of the flat-table schema, and make data analytics accessible to more front-line business personnel.
Further, StarRocks high-concurrency analytics capabilities are much better suited for our user-facing analytics sceneries (vs. our limited flexibility solution that relies on both Elasticsearch and a KV store). Finally, StarRock's powerful data lake analytics will accelerate our data warehouse's Hive-based interactive query model and gradually replace many if not all of our Presto workloads.
To the extent it has ownership over them, Airbnb claims all rights in any intellectual property embodied in this post; Airbnb authorizes no further re-use or re-publication of this content without its express, written consent. All non-Airbnb product names, logos, and brands are the property of their respective owners, are for identification purposes only, and do not imply endorsement by Airbnb of those products or brands.