Liansheng Zhida is part of Lenovo Group, a world-leading PC manufacturer. It is committed to building an intelligent logistics platform and providing integrated logistics solutions for industrial enterprises in China and abroad.
Liansheng Zhida 's big data team gradually introduced a variety of OLAP analysis engines, among which StarRocks stands out for its strong performance. By introducing StarRocks, we construct a new unified data service platform, which significantly reduces the complexity of data pipeline and dramatically improves the efficiency of BI analysis.
Before 2018, the amount of data in Liansheng Zhida was not huge. We used the traditional relational database (SQL Server) instead of a self-built data warehouse in this stage. SQL scripts were enough.
With rapid developments in business complexity and data volume, this model soon encountered a bottleneck. The primary manifestation is that the query response time became sluggish. For example, tasks that need to take 10 minutes or 20 minutes to run before take an hour now. There is also a bottleneck in data storage capacity.
In 2019, as the data warehouse was built and improved on the Apache Hadoop®/Apache Hive™ system, we transferred all ETL tasks to the Apache Hadoop cluster. At this stage, we use dozens of Presto to complete OLAP analysis since it shares metadata information with Apache Hive and supports flexible queries, and use Tableau to complete data analytics and data mining.
In 2021, Liansheng Zhida's big data team carried out the overall design and construction of offline data warehouse to meet the following requirements:
At this stage, we introduced StarRocks for its following advantages:
The entire system consists of data collection, storage, analysis, and application.
Apache Hive is responsible for all offline data ETL and data modeling jobs.
The data analysis layer provides query services. Data will be written to RDBMS or MPP database after ETL, then provide multiple downstream services for different scenarios such as self-service reporting(Tableau), multi-dimensional fixed reporting, and ad-hoc queries.
The data application layer provides reporting services for management and operation personnel. Queries generated in this layer are always complicated and unpredictable, and always need to be responded in real-time, especially in ad-hoc scenarios.
Advantages:
Disadvantages:
Advantages:
Disadvantages:
The core data of SEC(shipping to end customers) comes from two major factors: one is consumer business; the other is SMB business. The data center needs to summarize relevant business statistical indicators and provide analytics services.
Before introducing StarRocks, we used a large number of Apache Hive jobs to complete ETL. After ETL, part of the data was retained in Apache Hive, and part of the data was written to MySQL/SQL Server. Then, we use Presto to connect to Apache Hive, MySQL, SQL Server to achieve reporting and data visualization.
The original architecture mainly has the following two problems:
We hope to unify the original system with one OLAP engine. The requirements for the OLAP engine are relatively high.
After research, StarRocks fits the overall requirements. The new architecture has the following advantages:
StarRocks provides three data models: Duplicate Key Model, Aggregate Key Model, and Primary key Model.
For the SEC business, Duplicate Key Models are currently the mainstream. If there are other scenarios in the future, we will consider other options.
StarRocks' data partitioning and bucketing functions can greatly improve the performance of detailed queries in historical inventory and turnover data.
For example, a standard query scenario is the inventory turnover in a certain period of time. We can partition by delivery time to filter out unnecessary data, thus minimizing the date range covered by the query statement. By partition, bucket, prefix index and other capabilities, we can facilitate queries significantly.
These features can better accommodate business growth and the high concurrency scenarios that may arise in the future.
StarRocks’ materialized view can be built in real-time and on-demand. It can also be automatically selected at query. We build materialized views based on the SN inventory, product type, depot, and distributor. Based on these views, queries can be greatly accelerated.
Two methods are used to ingest data into StarRocks:
StarRocks supports both flat table and star schema. Flat tables and materialized views can ensure performance and concurrency, while star schema allows APs to model as in TP and perform join queries directly, thus improving the data consistency and development efficiency.
Many tables are stored in MySQL and queried as external tables, thus eliminating the need for the data ingestion process.
StarRocks' distributed Join and Materialized View give us the ability to build a unified analysis layer, which provides service for different BI reporting, thus improving the consistency of indicators and reducing repetitive development.
We built the BI reporting based on SQL Server and MySQL in the early stage. As the business grows, some of the reporting becomes more and more complex. Tableau reports would take a long time to load when users use multi-dimensional conditions to filter data, which is unacceptable.
After introducing StarRocks, we ingest SQL Server data into StarRocks via DataX. Here we use the StarRocks-Writer plug-in, whose import efficiency is very high.Data in MySQL can be ingested through “insert into select”, or directly query via external table, which is very convenient. In this way, Tableau chart appears in seconds, and the experience has been greatly improved.
The data center is a core online service with very high availability and flexible capacity scaling requirements. StarRocks supports multiple replicas of data. It has a simple architecture composed of only two roles, FE and BE, which can ensure the high availability of the entire cluster when either node fails. In addition, StarRocks can perform online elastic expansion under the large data volume, and there is no Down Time during expansion, which will not affect online business. This capability is also what we value.
Liansheng Zhida began to study StarRocks in April 2021. StarRocks can replace a dozen nodes in a Presto cluster with 1/4 of the resources and is currently online and running stably.
With the introduction of StarRocks, we have implemented a unified data service that greatly simplifies offline data processing while ensuring query latency requirements. We will use this to improve data services and query capabilities in more business scenarios.
We want to thank StarRocks for its strong support, hoping that StarRocks will grow as the leader of a new generation of MPP databases.
Apache®, Apache Hive™,Apache ZooKeeper™,Apache Flume™ and their logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.