Our app is a free messaging and calling app that connects more than one billion users across the world. Users can chat and make calls with families and friends, share favorite moments, enjoy mobile payment, and do things much more on the app. The app has become an important lifestyle that is woven into the fabric of people's everyday life.
Monitoring systems are mainly categorized into three types: framework monitoring system, monitoring system with custom metrics, and cloud-native observability platform.
Framework monitoring system: This system uses a customized built-in module to monitor calls between modules. The metrics include call chains of modules, main calls, background calls, average time of calls, and failures.
Monitoring system with custom metrics: This system is used to monitor business activities, such as payment, failed calls made to merchants, and success rates of app notifications. The monitoring task is completed on two platforms: a two-dimensional monitoring system and a multidimensional monitoring system.
In the two-dimensional monitoring system, users can monitor business activities in only two dimensions: user ID and Key columns. Users must first apply for an ID and then define key columns. This type of monitoring can be used only in limited scenarios to meet basic monitoring requirements. A more advanced OLAP platform for multidimensional monitoring is required.
The multidimensional monitoring system is more suitable for complex monitoring, especially for monitoring sales by city, by e-commerce operator, and error codes.
Cloud-native observability platform: This platform collects telemetry data, such as metrics, logs, and trace data from various sources and is compatible with OpenTelemetry.
This article introduces how StarRocks helps improve the multidimensional monitoring system for our app.
The multidimensional monitoring system in our company is a flexible and self-service OLAP platform based on various dimensions.
This monitoring system involves the following three important concepts: protocols, views, and anomaly detection platform.
Business scale of our multidimensional monitoring system:
The multidimensional monitoring system has the following business characteristics:
The multidimensional monitoring system has the following requirements:
Our monitoring platform processes a large number of protocols from various data scenarios. The data scenarios can be classified into four categories based on the volume of reported data and dimension complexity per minute (number of dimension combinations).
Overall data processing and analysis procedure:
The previous architecture uses the following methods to store and compute data:
Our multidimensional monitoring system excessively relies on Apache Druid and is confronted with the following pain points:
To resolve the preceding pain points, we have thoroughly tested and verified the performance of StarRocks when we introduce other OLAP platforms.
Advantages of StarRocks:
We use real data to test the performance of StarRocks on our platform from different aspects.
We use StarRocks Routine Load jobs to import JSON data from Apache Kafka.
Hardware configuration: 48 physical cores with hyper-threading enabled (total of 96 logical cores), Intel® Xeon® Platinum 8255 Processor, 192GB RAM, 16TB NVMe SSD; 5 hosts
Data source: real online data with a data load speed of 50 million rows per minute and an average size of 1 KB JSON data per record.
Test procedure: Data is first imported into Apache Kafka and accumulated for a certain period of time before StarRocks begins to consume the data. This helps pressure test the maximum load speed of StarRocks.
Settings of key parameters:
Test results: StarRocks demonstrates excellent loading performance. In the preceding figure, about 50 million records are loaded from Apache Kafka to StarRocks per minute. However, StarRocks can consume as many as 210 million records per minute from Apache Kafka, which is much higher than the write speed of Apache Kafka and fully meets our requirements for data loading.
Hardware configuration: 48 physical cores with hyper-threading enabled (total of 96 logical cores), Intel® Xeon® Platinum 8255 Processor, 192GB RAM, 16TB NVMe SSD; 5 hosts
Cardinality: 11,500,000 cardinalities (exclude the time dimension)
Database version: StarRocks 1.18.1, Apache Druid 0.20.0
Query response time in low concurrency scenarios
We use StarRocks and Apache Druid to make four time-series queries and four TopN queries in low-concurrency scenarios. Each query was performed five times, and the average value was used as the final result. During the process, Apache Druid rolls up data on all dimensions, and StarRocks uses materialized views.
StarRocks outperforms Apache Druid in 7 out of 8 tests over data with complex dimensions. StarRocks still performs very well in slow queries (TopN queries on high-cardinality fields). StarRocks completes the query within 5s, whereas Apache Druid takes about 20s. In conclusion, StarRocks performs far better than Apache Druid in low concurrency scenarios.
QPS in high concurrency scenarios
To simulate high concurrency, we use the same eight SQL statements to run 16 and 32 concurrent queries. We measure the QPS and average query response time of Apache Druid® and StarRocks. As listed in the table above, StarRocks outperforms Apache Druid in 12 out of 16 tests, and has a higher QPS than Apache Druid in 10 tests. However, both of the two cannot complete slow queries in high concurrency scenarios.
In conclusion, for high-concurrency queries on protocols with complex dimensions, StarRocks outperforms Apache Druid with flexible bucketing and materialized views. We also find that the query concurrency supported by StarRocks has a strong correlation with its CPU usage. The performance downgrades when the CPU usage rises high. However, Apache Druid is more stable.
StarRocks vs. Apache Druid®
Vendor | Strength | Weakness |
StarRocks |
|
|
Apache Druid |
|
|
The following figure shows the new architecture of our multidimensional monitoring system. Both StarRocks and Apache Druid are used as the storage and compute engines for the system.
StarRocks outperforms Apache Druid in data queries with complex dimensions. Therefore, we have migrated real-time compute and storage workloads of some complex protocols from Apache Druid to StarRocks. We also have migrated a total of 9 protocols to StarRocks, covering areas such as payment, video channels, search, and security.
The peak load speed has reached 70 million rows per minute and 60 billion rows per day. The average response time has been reduced from 1,200 ms to about 500 ms. That for slow queries has been reduced from 20s to 6s.
Our multidimensional monitoring system connects to thousands of data sources and various data scenarios. The extraordinary performance of StarRocks helps us resolve many issues which cannot be resolved in the Apache Druid architecture. We will continue to use these two OLAP engines as our major storage and compute engines. Meanwhile, we will actively explore the application of StarRocks in more scenarios.
We hope to see performance improvements of StarRocks in high concurrency and high CPU load scenarios. StarRocks' incoming release of decoupled storage and compute will further enhance its system stability. We believe StarRocks will go further on its journey towards a unified and ultra-fast OLAP database.
Apache®, Apache Flink®, Apache Kafka®, Apache Druid®, Apache ZooKeeper™ and their logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.