Join StarRocks Community on Slack
Connect on SlackOne of the most common solutions our customers compare us to in their evaluation process is ClickHouse, and that's no surprise. After all, ClickHouse is an established, well-known player in the real-time analytics space. What might be a surprise to some, however, is how often CelerData and StarRocks come out a winner at the end of that evaluation process. So why is that?
During a recent webinar, we discussed what sets CelerData and StarRocks apart from ClickHouse. In short, ClickHouse is limited by its reliance on denormalization whereas CelerData and StarRocks have no such weakness thanks to the ability to adeptly handle on-the-fly JOINs.
This discussion resulted in an insightful Q&A session that featured several common questions we get asked about our approach to real-time analytics. In case you don't have time to watch the full webinar, we'll review those questions here.
Question: How does the query optimizer in StarRocks, including JOIN optimization, differ from Spark's optimizer? Was there any inspiration drawn from other optimizers during the development of StarRocks?
Answer: Yes, StarRocks' optimizer has a cascade-like framework and was indeed influenced by academic research. For example, Efficiency in the Columbia database query optimizer [1].
StarRocks and Spark serve different purposes. While Spark is geared towards ETL workloads, StarRocks is optimized for low-latency queries. Hence, their optimizers differ as well.
[1] Xu, Y. (1998). Efficiency in the Columbia database query optimizer (Master's thesis, Portland State University).
Question: Why do I see ClickHouse outperform StarRocks on ClickBench when your data says otherwise?
Answer: The default ClickBench results are skewed because they incorporate all the tuned performances, and ClickHouse has adjusted their SQL queries for the benchmark, which isn't an equitable comparison.
By disabling the "tuned" option in ClickBench, users will see more accurate, competitive results. Follow this link to see ClickBench without the "tuned" results.
Figure 1: ClickBench Results
Question: If the internal storage and the compute node are decoupled, doesn't it increase the network overhead? What is the recommended design?
Answer: Decoupling does increase network overhead. However, StarRocks mitigates this with a data cache that locally stores frequently accessed data, both on local disks and in memory, to improve query performance.
This cache doesn't store persistent data, so when dropping a node, there is no data loss. Additionally, StarRocks offers a materialized view, particularly beneficial for data lake query performance, allowing data from external sources to be transformed and optimized by SR's own storage engine.
Question: Can you speak to the JOIN algorithms and strategies of each database including Snowflake, SingleStore, MySQL, and Postgres?
Answer: Detailed insights into Snowflake and SingleStore are limited due to their proprietary (closed source) nature, but Snowflake appears to operate similarly to StarRocks, judging by its query profiles.
MySQL and Postgres, being row-store databases, aren't optimized for OLAP workloads like StarRocks, ClickHouse, or Druid, which are columnar stores, thus faster for OLAP tasks. The ClickBench benchmark demonstrates that columnar stores significantly outperform row stores in single-table tests.
Question: Are there any drawbacks to shuffle join?
Answer: Shuffle join, while efficient in many cases, isn't universally optimal. For instance, when the right table is very small, a broadcast join might be more efficient. Therefore, the strategy depends on the specific use case and there is no "one solution fits all" in query optimization.
Question: Where can performance benchmarks be found?
Answer: Performance benchmarks are available on CelerData.com under resources, in the white papers and case studies section here.
Question: Is there ongoing development to enhance StarRocks' JOINs and optimizer?
Answer: Yes, the team continually refines and optimizes the system, often based on real-world user feedback. Recent developments include optimizations on the string JOIN key, data lake query metadata caching, statistics collection, and primary key tables. Please read the StarRocks release notes to learn more! You can read it on StarRocks' documentation page, or StarRocks' GitHub release page.
Have More Questions?
Hopefully, this article was able to answer the questions you may have had about CelerData, StarRocks, and how we compare to other solutions like ClickHouse. If you're still hungry for more answers, however, we encourage you to check our comparison pages for ClickHouse, Apache Druid, and Trino/Presto.
Additionally, if you'd like a deeper dive into what CelerData and StarRocks can do, don't hesitate to reach out for a demo or connect with our community on Slack.