Query Optimization
What is Query Optimization?
Query optimization is a feature that allows the optimizer to determine the most efficient way to execute a given query by considering various query plans. The query optimizer, a critical component of relational database management systems, usually operates behind the scenes, and users cannot access it directly. Once queries are submitted to the database server and parsed, they are passed to the query optimizer for optimization. However, some database engines allow users to guide the query optimizer with hints.
A query optimizer is responsible for selecting the most efficient execution plan for a given SQL query. SQL is a structured query language that tells the database "what" it wants but doesn't tell the database "how" to obtain the results. This "how" is determined by the database's "brain", known as the query optimizer. A query optimizer can be compared to a highly skilled travel agent who plans the most efficient route for a trip. When you provide the travel agent with your desired destinations, they don't just look at a map and choose the most straightforward path. Instead, they consider multiple factors, such as available transportation options, the time it takes to travel between destinations, costs, and potential disruptions like traffic or weather.
In the same way, a query optimizer takes an SQL query and determines the most efficient way to execute it by considering various query plans and their associated costs. It doesn't just choose the first plan it finds; instead, it evaluates multiple options to find the best execution plan, ensuring that the database performs at its optimal level. Like a travel agent, the query optimizer works behind the scenes to deliver the best possible result for the user, without the user having to worry about the intricate details of the process.
How Does Query Optimization Work?
The functioning of a query optimizer can be broadly divided into three main steps, which can be illustrated using the following example:
Suppose we have a database containing information about customers and their orders. We want to retrieve the details of customers who placed orders within the last 30 days.
-
Enumerate alternative execution plans: The optimizer generates multiple execution plans that can produce the desired query results. In this case, possible plans could include: a. Scanning the entire orders table, filtering for orders within the last 30 days, and then joining the results with the customers table. b. Using an index on the order date to quickly identify relevant orders, and then joining the results with the customers table. c. First filtering the customers table using any available criteria, and then joining the filtered customer data with the orders table. The optimizer considers factors such as table structure, available indexes, and system resources when generating these plans.
-
Estimate the cost of each execution plan: The optimizer evaluates the cost of each execution plan based on statistical information and a cost model. In this case, cost estimation might include factors such as:
-
a. Execution time: How long each plan is expected to take.
-
b. CPU usage: The amount of processing power required for each plan.
-
c. IO operations: The number of disk reads and writes needed for each plan.
-
d. Network resources: The amount of data transfer required between different nodes for distributed systems.
-
-
Choose the optimal execution plan: The optimizer selects the execution plan with the lowest estimated cost among the enumerated plans, ensuring the most efficient query execution possible. In this example, it might choose the plan that uses an index on the order date (plan b) if it is determined to be the most efficient option based on the cost estimations.
Rule-Based Optimizer (RBO) and Cost-Based Optimizer (CBO)
-
Rule-Based Optimizer (RBO): An RBO can be likened to a traditional paper map for navigation, which relies on a fixed set of routes to guide you to your destination, without considering current traffic conditions or other factors. This method can be effective when the routes are accurate and up-to-date, but it is less flexible and may not account for unexpected changes, such as road closures, construction, or heavy traffic, which can lead to suboptimal routes in some cases.
-
Cost-Based Optimizer (CBO): A CBO can be compared to a modern GPS navigation system that uses real-time traffic information, road conditions, and various route options to estimate the time and cost of each route. The navigation system selects the most efficient route, ensuring a faster and smoother journey. This approach requires accurate traffic information and a reasonable estimation model to function effectively, similar to how a CBO needs accurate statistical information and a reasonable cost model to optimize query execution.
In simpler terms, RBO is a strict optimizer that follows rules without considering data changes. It may not generate the best plan when data changes. CBO, however, relies on statistics and a cost model to find the best plan. The quality of statistics and the cost model affect CBO's ability to pick the best plan.
The cost model is the main algorithm, while statistics calculate the cost. They work together for cost calculation. If statistics are missing, the cost model uses default values, which may differ greatly from the actual values. This can lead to choosing a less efficient plan. So, accurate statistics are essential for the CBO model to work well. Without precise statistics, query optimization may not be effective.
Top Challenges Faced by Query Optimization
The query optimizer is the core module of relational database systems and is the focus and difficulty of database kernel development. It also serves as a "touchstone" for measuring the maturity of the entire database system. Although the query optimization theory has been around for over forty years, and the academic and industrial communities have developed a relatively complete query optimization framework (such as System-R's Bottom-up optimization framework and Volcano/Cascade's Top-down optimization framework), the core challenge surrounding query optimization remains the same—how to use limited system resources to choose the best possible execution plan for a query.
-
Challenge 1: Accurate statistical information and cost models Statistical information and cost models are the basic modules of the query optimizer, primarily responsible for calculating the cost of execution plans. Accurate statistical information and cost models have always been a challenging problem for database systems to solve, mainly for the following reasons:
-
Statistical information: There are two main challenges with collecting statistical information in database systems. First, if the information is collected through sampling, there will inevitably be sampling errors. Second, the collection of statistical information has a certain delay, meaning that when optimizing an SQL query, the optimizer often has to use statistical information from a previous point in time.
-
Selectivity calculation and intermediate result estimation: Selectivity calculation has always been a challenge in database systems. The academic and industry have been researching methods to improve the accuracy of selectivity calculation, such as dynamic sampling, multi-column histograms, etc. However, there are still challenges that persist. Take calculating the selectivity of JOIN predicates as an example, there is still no satisfactory solution.
-
-
Challenge 2: Massive plan space The plan space for complex queries is enormous. In many scenarios, the optimizer may not even be able to enumerate all equivalent execution plans. Efficiently enumerating execution plans in such a vast plan space has always been a challenge for the query optimizer.
StarRocks' Cost Based Optimizer
StarRocks' CBO employs the Cascades framework, uses various statistical information to estimate costs, and incorporates Transformation and Implementation rules. This enables the CBO to select the optimal execution plan with the lowest cost among tens of thousands of possible execution plans in the search space.
The CBO optimizer utilizes a variety of statistics that StarRocks collects periodically, including the number of rows, average column size, cardinality information, the amount of NULL-valued data, and MAX/MIN value.
StarRocks supports full or sampled statistics collection, either manually or periodically, helping the CBO optimizer refine cost estimation and select the optimal execution plan.
StarRocks features an intelligent Cost-Based Optimizer which ensures efficient CPU utilization, particularly during the processing of complex queries. This is achieved through its branched execution planning that uses cascading logic, enabling multiple parts of a query to be processed simultaneously without waiting for others to complete. The query execution plan is automatically optimized, becoming more effective by utilizing real statistics from your data and queries.
Lastly, StarRocks delivers extremely fast performance on object storage like S3. When accessed through data lakes such as Lake House, Hive, Hudi, and Iceberg, the statistical information provided by the data lake module also informs StarRocks' intelligent cost-based optimizer without the need to load data into StarRocks.