If you work in the data engineering and infrastructure space, you know that choosing the right analytical database is important. Few things can impact the quality of your work (and your general happiness) like building on top of the wrong database. The best way to avoid future headaches is with a thorough evaluation process that ensures the solution you select for your business adequately meets your needs.
Unfortunately, evaluations can be quite time consuming, and you have limited time to make a decision in most cases. How can you be sure you're even looking at the right solutions? This article will take a look at one such approach to accelerating your search for the right solution: ClickBench.
Traditionally, benchmark reports, industry press coverage, and even forum posts have been a useful way to quickly understand your options, but this can also be demanding on your time. Recognizing this challenge, more and more benchmarking and comparison tools are starting to show up. These software tools and websites make it easy to pull key performance metrics and product features together to quickly identify solutions worth your time evaluating. One such solution is ClickBench, a comparison tool that offers a rather exhaustive list of the top analytical databases available today.
Launched in 2022, ClickBench is a free-to-use benchmarking tool built and maintained by the team behind ClickHouse. Designed to help users quickly compare both open-source and proprietary analytical databases, ClickBench has grown in popularity thanks to its well-maintained (and growing) list of databases available for comparison.
For testing purposes, the benchmark offered by ClickBench includes:
A dataset that contains a flat table with exactly 99,997,497 records.
43 queries that cover a range of data analytics use cases with a flat table schema.
ClickBench is a breeze to test with as well. According to its maintainers, benchmarking a system end-to-end with ClickBench should take less than 20 minutes.
ClickBench also comes with an easy-to-use interactive dashboard, which not only gives users a way to compare the performances of different databases with a few simple clicks of the mouse, but also provides an opportunity for users to discover new solutions that they may not have been aware of. The ClickHouse team has put together a really great tool for performance comparisons, and its popularity is well-deserved, but there are some things users should know before they start using ClickBench in their evaluation process.
Thanks to its user-friendly dashboard, ClickBench has a pretty flat learning curve. Although simple, it does not fully replace the evaluations one would do when looking for new solutions. It's important to know what ClickBench should be used for and what it can't tell you.
ClickBench provides users with a fast and simple way to view and reproduce the ranking of databases by analytics performance (i.e., query speed). The results are based on a predefined dataset and a collection of queries, so while ClickBench can't be used to evaluate based on your specific scenarios, it can be very useful for:
Quickly building your shortlist of databases for further evaluation.
Discovering new solutions that offer superior performance you may not have known about.
Developing a general sense of who the current leaders are when it comes to database performance.
ClickBench's flexibility in filtering results by hardware configurations, software versions, and even cluster size, means it can still provide some useful metrics even if the benchmark is not built on your specific business scenarios.
ClickBench's greatest strength comes from being able to help you get started evaluating databases quickly. It's able to deliver on this because of its reliance on a relatively small dataset and simple testing paradigm. Adding a solution to it or reproducing a result is fast. However, this agility does come with limitations:
The dataset ClickBench uses only contains a flat table, which only works in some scenarios. However, most of the time, star or snowflake schema is far superior to ClickBench's oversimplified data pipeline (e.g. easier schema and data change). In order to utilize the advantages of star or snowflake schema, on-the-fly JOIN capability is required, which is not represented in the ClickBench benchmark test. For multi-table query performance benchmarking, you should look at TPC-H and TPC-DS benchmark tests.
The scale of the benchmark is relatively small, with only 99,997,497 records, and most of the tests are done in a non-distributed environment.
A lack of concurrency and multi-tenant testing: When a database is deployed in production at scale, adapting to a massive number of users is a must. With multiple tenants, we need to evaluate features such as high concurrency and resource isolation. This is out of scope for what ClickBench can test.
Benchmarks can't tell the whole story: This is true even for query performance. Apart from a fast execution engine and generating optimal execution plans, there are other means of query acceleration (such as secondary indexes) or even (partially) bypassing computation altogether (such as materialized views and query result cache). Query optimization methods can vary between databases, which makes it very difficult to test with a one-size-fits-all benchmark tool.
To put this all more simply, results from ClickBench, while insightful, are insufficient on their own for making critical decisions.
Even if ClickBench's results are far from definitive, the StarRocks community loves a challenge and was eager to see how the project compared to other databases on ClickBench as soon as the tool was announced on Hacker News.
The first thing we needed to do was understand how ClickBench handles results. For all databases included in Clickbench, there are two types of results: fine-tuned and vanilla.
Since vanilla results are the recommended way of benchmarking a database with ClickBench, we opted to focus on our performance there.
In ClickBench, for a result to be classified as vanilla, it needs to be:
Out-of-the-box with no manual configuration.
Use the default SQL scripts with no modification for performance boosts.
Once we understood how we were being evaluated, the StarRocks community got to work testing StarRocks on ClickBench. We're proud to say that, as of the writing of this post, StarRocks has reached the top spot in ClickBench's vanilla query performance benchmark.
This is a nice feather in the cap of the StarRocks project, and a big achievement for the StarRocks community that came together to get it here. We're also grateful to ClickBench for providing the opportunity (and challenge) to showcase StarRocks' performance. We're excited to see how ClickBench as a tool develops. In the meantime, the StarRocks community will keep optimizing its performance, especially in multi-table join queries, to ensure users continue to enjoy blazing-fast query performance in all scenarios.
ClickBench is a fantastic tool for data engineers looking to get a leg up on their evaluation process for a new analytical database. It is not, however, a source for definitive rankings of the best analytical databases. It can only really tell you about a database's performance, and even there you won't be getting the whole story.
If you want to get serious about evaluating solutions, there are four other important criteria you should be looking at. These are:
Timeliness
Scalability
Operational Efficiency
Cost Effectiveness
And if you'd really like to make sure your next evaluation is a success, you should check out this article on the 5 most important criteria for choosing an analytics platform.