What Really Happens When You Run a SQL Query? A Look into Query Execution

Written by Admin | Sep 6, 2024 12:46:14 AM

What Is Query Execution?

Query execution is the process by which a database management system (DBMS) processes a SQL query and retrieves the requested data. The DBMS performs several internal operations, including parsing, optimization, and the execution of a physical plan, which is ultimately responsible for accessing, filtering, and returning the correct dataset.

Role in Database Management

Query execution plays a crucial role in database management as it is responsible for processing SQL queries and fetching data from the database. Efficient query execution ensures that data retrieval is fast and resource usage is optimized, which is especially important for large-scale analytical queries in modern database systems like StarRocks. In database management, query execution involves multiple components that work together to ensure queries are parsed, analyzed, optimized, and executed efficiently.

Components of Query Execution

Query execution is a vital process in database management, responsible for processing SQL queries to retrieve data in an efficient and optimized manner. Modern database systems, such as StarRocks, handle complex queries by breaking down the query execution process into several key components. These components work together to ensure that the query is parsed, analyzed, optimized, and executed in a way that minimizes resource consumption and delivers the results quickly.

The main components involved in query execution are:

SQL Parser: Converts SQL text into an internal structure (AST) that the database can work with.
SQL Analyzer: Ensures that the query is syntactically and semantically correct.
Logical Plan: Represents the logical steps required to retrieve data based on relational algebra.
Query Optimizer: Converts the logical plan into an optimized physical execution plan, aiming for the least resource-intensive execution.
Plan Fragment Generator: Breaks down the optimized physical plan into fragments that can be distributed across multiple nodes for parallel execution.

How the Query Execution Works

To better illustrate the query execution process, we will use StarRocks, an MPP (Massively Parallel Processing) database system, as an example. StarRocks handles complex SQL queries by breaking them down into multiple stages, including query parsing, optimization, scheduling, and execution across distributed compute nodes. Each stage involves key components working together to ensure efficient and optimized execution.

Below is a step-by-step explanation of how StarRocks handles query execution for a SQL query, illustrating its high-performance capabilities.

SQL Parsing

In StarRocks, the first step in query execution is SQL Parsing. The input to the parsing process is the SQL query string, and the output is an Abstract Syntax Tree (AST). The AST is a hierarchical structure where each node represents a ParseNode. For example, when you submit a query, the system will generate an object known as a QueryStmt, which includes various elements such as SelectList, FromClause, WherePredicate, GroupByClause, HavingPredicate, OrderByElement, and LimitElement. Each element corresponds closely to the components in the original SQL text.

Currently, StarRocks uses ANTLR4 as its parser generator, with the grammar rules defined in the StarRocks.g4 file.

SQL Analyzer

Once StarRocks has obtained the AST, it proceeds to SQL Analysis, where both syntactic and semantic checks are performed. During this phase, StarRocks performs the following tasks:

Binding database, table, and column metadata: The system checks that all referenced tables and columns exist in the database.
SQL validity checks: Ensures that certain operations are legal within the context of the query. For example, operations like GROUPING are not allowed in the WHERE clause, and columns using HLL and Bitmap cannot be summed.
Alias handling for tables and columns: StarRocks resolves any aliases used in the query.
Function argument validation: For instance, the argument passed to the SUM function must be numeric, and the second and third parameters for LEAD and LAG window functions must be constants.
Type checking and conversions: For example, when comparing BIGINT with DECIMAL, the system automatically casts the BIGINT type to DECIMAL for the comparison to be valid.

The result of this analysis is a hierarchical structure known as a Relation Tree. For example, a FROM clause will correspond to a TableRelation, while a subquery will generate a SubqueryRelation.

SQL Logical Plan

After the SQL analysis, StarRocks converts the Relation Tree into a Logical Plan Tree. Each set of operations in the query corresponds to a logical node within this tree. For example, a query that performs a join and then filters data will have corresponding logical nodes for these operations. The Logical Plan provides an abstract view of the data processing operations but does not yet consider how these operations will be executed physically.

SQL Optimization

The StarRocks Optimizer takes the Logical Plan Tree as input and produces a Distributed Physical Execution Plan with the lowest possible cost. The importance of the optimizer grows with the complexity of the query, especially when multiple tables are involved, and large datasets need to be processed. The performance of different execution strategies can vary by orders of magnitude.

StarRocks’ optimizer is entirely custom-built, incorporating ideas from the Cascades and ORCA optimization papers. It has been deeply customized for StarRocks’ execution engine and scheduler, resulting in several performance enhancements and innovations. StarRocks fully supports all 99 TPC-DS benchmark queries and implements optimizations such as Common Expression Reuse, Subquery Rewrites, Lateral Joins, CTE Reuse, Join Reordering, Distributed Join Strategy Selection, Global Runtime Filter Pushdown, and Low Cardinality Dictionary Optimization.

Logical Plan Rewrite

Before entering the Cost-Based Optimizer (CBO) phase, StarRocks applies a series of rewrite rules to the logical plan to produce a more efficient logical structure. Key rewrite rules include:

Expression rewriting and simplification: Optimizing expressions in the query.
Column pruning: Removing unnecessary columns from the query plan to reduce data processing.
Predicate pushdown: Moving filter conditions closer to the data source for efficiency.
Limit merge and pushdown: Combining or moving LIMIT clauses to reduce the number of rows processed.
Aggregation merge: Combining multiple aggregation operations.
Equality predicate inference: Deriving additional filter conditions from known predicates.
Outer Join to Inner Join conversion: Converting joins where possible to optimize performance.
Constant folding: Simplifying constant expressions in the query.
Common expression reuse: Reusing subexpressions across different parts of the query.
Subquery rewriting: Rewriting subqueries for better performance.
Lateral join simplification: Optimizing lateral joins.
Partition and bucket pruning: Eliminating unnecessary data partitions or buckets.
Empty node optimization: Removing nodes that do not return any data.
Empty UNION, INTERSECT, and EXCEPT pruning: Removing unnecessary set operations.
Set operation reordering: Optimizing the order of set operations.
Count distinct aggregation rewrite: Optimizing aggregation functions such as COUNT(DISTINCT).

Cost-Based Optimization (CBO) Transform

After the logical plan rewrite, StarRocks applies the Cost-Based Optimizer (CBO), which is based on the Columbia research paper. The CBO performs further optimizations such as:

Multi-phase aggregation: For simple aggregations like COUNT, SUM, MAX, and MIN, StarRocks can split them into two phases. For COUNT(DISTINCT) queries, it can be split into three or four phases.
Join order optimization: StarRocks ensures that the smaller table is always used to build the hash table, and it can automatically switch between left and right joins based on cost estimations.
Join reordering: For multi-table joins, StarRocks determines the optimal join order. For five or fewer tables, StarRocks uses the associative and commutative properties of joins to find the best ordering. For more than five tables, a greedy algorithm and dynamic programming are used to reorder joins efficiently.
Distributed join execution: StarRocks supports various distributed join strategies such as Broadcast, Shuffle, Single-Side Shuffle, Colocate, and Replicated Joins. The optimizer selects the best strategy based on cost and enforcement rules.
Aggregation pushdown to join: Where possible, aggregations are pushed down to the join operation.
Materialized view selection and rewrite: StarRocks can automatically rewrite queries to use materialized views where available.

During the CBO phase, the logical plan is transformed into a Memo data structure, which records both logical and physical execution plans. The Memo serves as the search space for the optimizer to evaluate multiple strategies, expanding this space with various transformation rules. Finally, StarRocks selects the execution plan with the lowest estimated cost based on statistics and cost models.

Learn more about Cost-Based Optimizer (CBO)

Statistics and Cost Estimation

The accuracy of the Cost-Based Optimizer largely depends on accurate cost estimation, which, in turn, relies on timely and accurate statistics collection. StarRocks supports both table-level and column-level statistics and provides options for either automatic or manual collection. Users can choose between full and sampling-based statistics collection methods.

With these statistics, StarRocks estimates the cost of a query by factoring in CPU usage, memory requirements, network bandwidth, and disk I/O. Different operators in the query have different cost formulas, which are adjusted based on the query’s specific characteristics.

When query performance issues arise—such as inefficient join orders or incorrect join distribution strategies—users can refer to StarRocks’ CBO documentation to guide them through collecting and using statistics to improve query performance.

Plan Fragment Generator

The Plan Fragment Generator takes the optimized physical plan and breaks it down into Plan Fragments. These fragments represent smaller, self-contained units of the plan that can be executed independently and in parallel across different compute nodes in the cluster.

Fragment Distribution: Each fragment is assigned to one or more backend nodes (BEs), and the optimizer ensures that fragments are distributed in a way that balances the workload.
Scan and Compute: Some fragments handle scanning data from disk, while others perform transformations, aggregations, and joins.
Parallel Execution: Plan fragments are designed for parallelism, allowing them to be executed concurrently across multiple nodes to maximize throughput and minimize query latency.

Scheduling the Query Execution Plan

In StarRocks, the system is divided into two main components: Frontend (FE) and Backend (BE). The FE is responsible for query parsing, optimization, task scheduling, and metadata management. It receives SQL queries, generates optimized execution plans, and distributes query fragments to the BE nodes. The BE nodes are responsible for executing these fragments, handling data processing tasks such as scanning, filtering, joining, and aggregating data in a distributed and parallelized manner.

Once the distributed plan has been generated, the Frontend (FE) is responsible for scheduling the execution. This involves generating execution instances for the plan fragments, assigning fragments to nodes, managing the execution state on each BE, and receiving the final query results.

Some of the key challenges StarRocks must address during this stage include:

Deciding which BE node executes which Plan Fragment.
Selecting the appropriate replica for each tablet.
Scheduling multiple Plan Fragments efficiently.

StarRocks first confirms where the Scan Operator will execute by assigning it to specific BE nodes based on the tablet list it needs to scan. For each tablet, StarRocks selects a healthy and available replica that matches the desired version. When multiple replicas are available, StarRocks uses a randomized selection process but strives to balance requests across BE nodes. For example, if there are 10 BE nodes and 10 tablets, each BE is expected to scan one tablet.

Once the Scan fragments are assigned to BE nodes, other Plan Fragment instances are also scheduled on the same or other randomly selected BE nodes, depending on parameters set by the user.

After the FE determines which Plan Fragments will run on which BE nodes, it sends execution parameters to the BE using Thrift.

Currently, the FE schedules multiple Plan Fragments all at once, traversing the Plan Fragment tree from top to bottom and sending each Plan Fragment’s execution information to the corresponding BE nodes.

Executing the Plan

StarRocks employs a Massively Parallel Processing (MPP) architecture to utilize multiple machines in parallel for maximum query performance. Additionally, it uses Pipeline Execution for efficient use of multicore processors and vectorized execution to fully leverage individual CPU cores.

MPP Parallel Execution

MPP, or Massively Parallel Processing, is a technique where the query plan is divided into many smaller instances that can be executed on individual nodes. None of the nodes share CPU, memory, or disk resources, and the performance of MPP queries can scale linearly as the cluster expands.

As shown in the diagram, StarRocks logically splits a query into multiple Query Fragments, each of which can have one or more execution instances. These fragments are dispatched to different BE nodes for parallel execution. Each fragment may include multiple operators, such as Scan, Filter, and Aggregate. Fragments are executed with varying degrees of parallelism, depending on the query complexity and available resources.

multiple fragments in StarRocks are executed in parallel in memory through a Pipeline mechanism, rather than being processed stage by stage like in batch processing engines. The Shuffle operation, which redistributes data across nodes, is crucial for improving query performance as the MPP (Massively Parallel Processing) database scales horizontally with the cluster. This operation is also key to efficiently handling high-cardinality aggregations and large table joins.

Pipeline Parallel Execution

StarRocks introduces the concept of Pipeline Execution between fragments and operators. Within a pipeline, data is processed continuously without the need for materialization at each step. However, if an operator requires materialization (e.g., an AGG, SORT, or JOIN operator), a new pipeline is created. Therefore, a single fragment may correspond to multiple pipelines.

Each pipeline consists of several operators. The first operator is typically a Source Operator (e.g., a Scan or Exchange node), which generates the data. The last operator is a Sink Operator, which materializes or consumes the data. Operators between the Source and Sink nodes are responsible for transforming the data.

Pipeline in StarRocks, like fragments, can generate multiple instances, each referred to as a Pipeline Driver. When a pipeline requires a certain level of parallelism, it creates as many pipeline drivers as needed. For example, if the parallelism is set to 3, one pipeline will generate three pipeline drivers.

During execution, if one operator can produce data and the next operator can consume it, the pipeline execution thread pulls data from the first operator and pushes it to the next. Each pipeline operates in one of three clear states: Ready, Running, or Blocked. If the first operator is unable to produce data or the subsequent operator does not need it, the pipeline enters a Blocked state.

The core of StarRocks' pipeline parallel execution framework is based on user-space coroutine scheduling, which minimizes the overhead associated with creating, destroying, and switching threads in the operating system. In this framework, StarRocks runs a number of execution threads equivalent to the number of CPU cores, with each thread retrieving ready pipelines from a multi-level feedback queue. Additionally, a global Poller thread constantly checks blocked pipelines to see if they can be moved back to the ready queue for execution once the blockage is resolved.

Vectorized Execution

As the performance bottleneck in modern databases shifts from I/O to CPU, StarRocks has implemented a fully vectorized execution engine to take full advantage of modern CPU architectures. Vectorization in StarRocks refers to executing operators and expressions in batches of columns rather than row-by-row. This method significantly improves performance by minimizing function calls, reducing branch mispredictions, and improving CPU cache utilization.

By processing columns in batches, StarRocks achieves greater efficiency, particularly in CPU-bound workloads. It also enables the use of SIMD (Single Instruction, Multiple Data) instructions, allowing a single CPU instruction to operate on multiple data points simultaneously.

Vectorized execution is more than just processing operators and expressions in a columnar fashion—it represents a comprehensive performance optimization effort that includes redesigning data structures and algorithms, optimizing memory management, leveraging SIMD instructions, and refining CPU cache usage. As a result, StarRocks has achieved performance improvements of 5 to 10 times compared to the previous row-based execution model.

Conclusion

Query execution is a fundamental process in database management, responsible for translating SQL queries into actionable steps that retrieve and process data efficiently. In modern systems like StarRocks, this process is highly optimized and involves several key stages: parsing, analysis, optimization, and execution. StarRocks exemplifies how a database system can handle complex, large-scale queries through its use of advanced techniques such as Cost-Based Optimization (CBO), distributed execution, pipeline parallelism, and vectorized execution. These methods work together to ensure that queries are executed with minimal resource consumption and maximum performance, particularly in multi-node, high-performance environments. By breaking queries into smaller, parallelizable tasks, leveraging modern CPU capabilities, and efficiently managing resources across nodes, StarRocks demonstrates the importance of well-executed query processing in enabling fast, scalable, and accurate data retrieval.

View full post