Lenovo - Simplifying Data Processing and BI Analysis | CelerData

Written by CelerData | Aug 24, 2022 7:43:09 PM

Liansheng Zhida is part of Lenovo Group, a world-leading PC manufacturer. It is committed to building an intelligent logistics platform and providing integrated logistics solutions for industrial enterprises in China and abroad.

Liansheng Zhida 's big data team gradually introduced a variety of OLAP analysis engines, among which StarRocks stands out for its strong performance. By introducing StarRocks, we construct a new unified data service platform, which significantly reduces the complexity of data pipeline and dramatically improves the efficiency of BI analysis.

Evolution History of OLAP Engines

Phase I

Before 2018, the amount of data in Liansheng Zhida was not huge. We used the traditional relational database (SQL Server) instead of a self-built data warehouse in this stage. SQL scripts were enough.

With rapid developments in business complexity and data volume, this model soon encountered a bottleneck. The primary manifestation is that the query response time became sluggish. For example, tasks that need to take 10 minutes or 20 minutes to run before take an hour now. There is also a bottleneck in data storage capacity.

Phase II

In 2019, as the data warehouse was built and improved on the Apache Hadoop®/Apache Hive™ system, we transferred all ETL tasks to the Apache Hadoop cluster. At this stage, we use dozens of Presto to complete OLAP analysis since it shares metadata information with Apache Hive and supports flexible queries, and use Tableau to complete data analytics and data mining.

Phase III

In 2021, Liansheng Zhida's big data team carried out the overall design and construction of offline data warehouse to meet the following requirements:

Support low-latency BI reporting;
Support complex ad-hoc queries;
Support real-time queries of detailed data.

At this stage, we introduced StarRocks for its following advantages:

It supports ad-hoc queries for multi-table join;
It supports complex nested sub-query;
It supports queries on detailed data and multi-dimensional materialized view, outperforming ClickHouse.

Data Analytics System Architecture

Status-Quo

The entire system consists of data collection, storage, analysis, and application.

Data Collection

Data was read from RDBMS through Apache Scoop™ and imported into Apache Hive.
Apache Flume™ was used to synchronize log files to Apache Hive.
Data crawled from the Internet will be stored in RDBMS, and then read by Sqoop and imported into Apache Hive.

Data Storage

Apache Hive is responsible for all offline data ETL and data modeling jobs.

Data Analysis

The data analysis layer provides query services. Data will be written to RDBMS or MPP database after ETL, then provide multiple downstream services for different scenarios such as self-service reporting(Tableau), multi-dimensional fixed reporting, and ad-hoc queries.

Data Application

The data application layer provides reporting services for management and operation personnel. Queries generated in this layer are always complicated and unpredictable, and always need to be responded in real-time, especially in ad-hoc scenarios.

A Comparison of OLAP Analysis Tools

ClickHouse

Advantages：

Strong single-table query performance, suitable for multi-dimensional analysis.
Contains a rich MergeTree Family and supports pre-aggregation.
Very suitable for large-scale log analysis.

Disadvantages:

Does not support REAL deletion and update.
Join method is not friendly enough.
The concurrency capability is relatively low.
The merge of MergeTree is not complete.

StarRocks

Advantages:

Supports both single table and multi-table queries.
High concurrency.
Support real-time data ETL in micro-batch.
Support stream and batch data ingestion.
Compatible with MySQL protocol and standard SQL.

Disadvantages:

Insufficient large-scale ETL capabilities.
Resource isolation is not perfect yet.

Application Practice of StarRocks in SEC Data Center

The core data of SEC(shipping to end customers) comes from two major factors: one is consumer business; the other is SMB business. The data center needs to summarize relevant business statistical indicators and provide analytics services.

Original Solution

Before introducing StarRocks, we used a large number of Apache Hive jobs to complete ETL. After ETL, part of the data was retained in Apache Hive, and part of the data was written to MySQL/SQL Server. Then, we use Presto to connect to Apache Hive, MySQL, SQL Server to achieve reporting and data visualization.

Technical Challenges

The original architecture mainly has the following two problems:

The data platform is not well integrated, making it difficult to maintain and respond to new demand quickly.
Presto is slow on the complex SQL generated by Tableau, which cannot meet the needs of real-time data viewing.

We hope to unify the original system with one OLAP engine. The requirements for the OLAP engine are relatively high.

It needs to support large throughput data ingestion.
It needs to be able to respond to flexible queries of multi-dimensional in less than 100ms.
It needs to provide better support for multi-table join.
It needs to be able to respond less than 100ms to queries of a single table with a data volume of more than 1 billion.

After research, StarRocks fits the overall requirements. The new architecture has the following advantages:

The overall structure is clear. The RDBMS focuses on data cleaning while reporting calculation is moved from Apache Hive to StarRocks, making StarRocks the endpoint for data service.
StarRocks can maintain a unified data platform, one data input with multiple APP interface output.
StarRocks' MPP architecture can better support distributed aggregation and join queries.
StarRocks has good compatibility with Tableau, meeting BI analysis needs.StarRocks Solution

Upgraded Architecture

Data Model Design

StarRocks provides three data models: Duplicate Key Model, Aggregate Key Model, and Primary key Model.

For the SEC business, Duplicate Key Models are currently the mainstream. If there are other scenarios in the future, we will consider other options.

Data Partition/Bucket

StarRocks' data partitioning and bucketing functions can greatly improve the performance of detailed queries in historical inventory and turnover data.

For example, a standard query scenario is the inventory turnover in a certain period of time. We can partition by delivery time to filter out unnecessary data, thus minimizing the date range covered by the query statement. By partition, bucket, prefix index and other capabilities, we can facilitate queries significantly.

These features can better accommodate business growth and the high concurrency scenarios that may arise in the future.

Materialized View

StarRocks’ materialized view can be built in real-time and on-demand. It can also be automatically selected at query. We build materialized views based on the SN inventory, product type, depot, and distributor. Based on these views, queries can be greatly accelerated.

Data Ingestion

Two methods are used to ingest data into StarRocks:

Ingest the Apache Hive tables of offline data into StarRocks by Broker Load.
Ingest data from SQL Server and MySQL to StarRocks through DataX.

StarRocks in Practice

Flexible Modeling Improves Development Efficiency

StarRocks supports both flat table and star schema. Flat tables and materialized views can ensure performance and concurrency, while star schema allows APs to model as in TP and perform join queries directly, thus improving the data consistency and development efficiency.

Many tables are stored in MySQL and queried as external tables, thus eliminating the need for the data ingestion process.

StarRocks' distributed Join and Materialized View give us the ability to build a unified analysis layer, which provides service for different BI reporting, thus improving the consistency of indicators and reducing repetitive development.

Excellent BI Experience

We built the BI reporting based on SQL Server and MySQL in the early stage. As the business grows, some of the reporting becomes more and more complex. Tableau reports would take a long time to load when users use multi-dimensional conditions to filter data, which is unacceptable.

After introducing StarRocks, we ingest SQL Server data into StarRocks via DataX. Here we use the StarRocks-Writer plug-in, whose import efficiency is very high.Data in MySQL can be ingested through “insert into select”, or directly query via external table, which is very convenient. In this way, Tableau chart appears in seconds, and the experience has been greatly improved.

Cheaper to Maintain

The data center is a core online service with very high availability and flexible capacity scaling requirements. StarRocks supports multiple replicas of data. It has a simple architecture composed of only two roles, FE and BE, which can ensure the high availability of the entire cluster when either node fails. In addition, StarRocks can perform online elastic expansion under the large data volume, and there is no Down Time during expansion, which will not affect online business. This capability is also what we value.

Conclusion

Liansheng Zhida began to study StarRocks in April 2021. StarRocks can replace a dozen nodes in a Presto cluster with 1/4 of the resources and is currently online and running stably.

With the introduction of StarRocks, we have implemented a unified data service that greatly simplifies offline data processing while ensuring query latency requirements. We will use this to improve data services and query capabilities in more business scenarios.

We want to thank StarRocks for its strong support, hoping that StarRocks will grow as the leader of a new generation of MPP databases.

Apache®, Apache Hive™,Apache ZooKeeper™,Apache Flume™ and their logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.

View full post