Intro
In a Database System, for a given query, find a correct execution plan that has the lowest 'cost'.
The cost is a relative measure and is difficult to calculate.
The query optimizer is the part of a DBMS that is the hardest to implement well.
The query optimization problem faced by everyday query optimizers gets more and more complex with the ever increasing complexity of user queries. The NP-hard join ordering problem is a central problem that an optimizer must deal with in order to produce optimal plans.
No optimizer truly produces the 'optimum' plan:
a) use estimation techniques to guess real plan cost
b) use heuristics to limit the search space.
"If the best query plan needs 5 seconds to execute but we need 20 seconds to find it, it is a bad trade-off"
Processing of an SQL query
A SQL statement is subjected to four phases of processing. Depending on the origin and contents of the statement, these phases may be separated by arbitrary intervals of time. An overview of the processing steps relevant to access path selection (what we generally refer to as physical plan, in contrast to the logical plan) will be presented in this section.
The four phases of statement processing (in System R) are parsing, optimization, code generation and, execution.
Each SQL statement is sent to the Parser where it is checked for correct syntax. A query block is represented by a SELECT list, a FROM list, and a WHERE tree, containing, respectively the list of items to be retrieved, the tables referenced, and the boolean combination of predicated specified by the user.
If the Parser returns without any errors detected, the Optimizer component is called. The Optimizer accumulates the names of tables and columns referenced in the query and looks them up in the Syntax Catalog to verify their existence and to retrieve information about them.
The catalog lookup portion of the Optimizer also contains statistics about the referenced relations, and the access paths available on each of them. These will be used later in access path selection. After the catalog lookup the Optimizer rescans the select list and where tree to check for semantic errors and type compatibility in both expressions and predicate comparisons.
Finally the optimizer performs access path selection. It first determines the evaluation order among query blocks in the statement. Then for each block, the relations in the From list are processed. If there is more than one relation in the block, permutations of the join order are evaluated. The access paths that minimize total cost for the block are chosen from a tree of alternate path choices. This minimum cost solution is represented by a structural modification of the parse tree. The result in an execution plan in the Access Specification Language (ASL).
After a plan is chosen for each query block represented in the parse tree, the Code Generator is called. This is a table-driven program which translated ASL trees into machine language code to execute the plan chosen. In doing this it uses a relatively small number of code templates, one for each type of join method. Query blocks for nested queries are treated as "subroutines" which return values to the predicates in which they occur. During code generation, the parse tree is replaced by executable machine code and its associated data structures.
"An index scan need not scan the entire relation. Starting and stopping key values may be specified in order to scan only those tuples which have a key in a range of index values. Both index and segment scans may optionally take a set of predicates, called search arguments (or SARGS), which are applied to a tuple before it is returned to the caller (RSI). If the tuple satisfies the predicates, it is returned; otherwise the scan continues until it either finds a tuple which satisfies the SARGS or exhausts the segment or the specified index value range. This reduces cost by eliminating the overhead of making RSI calls for tuples which can be efficiently rejected. Not all predicated are of the form that can become SARGS. A sargable predicate is one of the form "column-comparison operator value". SARGS are expressed as boolean expression of such predicates in disjunctive normal form".
"... We now consider the order in which the relations are chosen to be joined. It should be noted that although the cardinality of the join of n relations is the same regardless of join order, the cost of joining in different order can be substantially different. If a query block has n relations in its From list, then there are n! permutations of relation join orders. The search space can be reduced by observing that once the first k relations are joined, the method to join the composite to the k+1-st relation is independent of the order of joining the first k; i.e. the applicable predicates are the same, the set of interesting orderings is the same, the possible join methods are the same, etc. Using this property, an efficient way to organize the search is to find the best join order for successively larger subsets of tables."
Concluding the Access Path Selection in an RDBMS (System-R):
The System-R access path selection preliminary results indicate that the cost of path selection is not overwhelming. For a 2-way join, the cost of optimization is approximately equivalent to between 5 and 20 database retrievals. Additionally, the cost of optimization is amortized over many runs.
Andy Pavlo breaks things up slightly differently:
1) Parser -> (produces) Abstract Syntax Tree (AST)
2) Binder -> Annotated AST
3) Optimizer -> Physical Plan
4) Physical Plan generation
Logical vs Physical Plans:
The optimizer generates a mapping of a logical algebra expression to the optimal equivalent physical algebra expression. Physical operators define a specific execution strategy using a particular access path. They can depend on the physical format of the data they process. There is not always a one-to-one mapping from logical to physical.
Cost Estimation:
Its purpose is to generate an estimate of the cost of executing a plan for the current state of the database. It should take into consideration the following:
1) Interactions with other work currently running on the DBMS
2) The size of the intermediate results
3) Choices of algorithms, access methods
4) Resource utilization (CPU, I/O, Network)
5) Data properties (skew, order, placement)
Optimizer Design Choices
Optimization Granularity
- Single Query
- Multiple Queries
Optimization Timing
Plan Stability (not major fluctuations in execution time)
- Hints
- Fixed Optimizer versions
- Backwards-compatible plans
Several approaches have been proposed to enumerate over the search space of equivalent plans for a given user query. These include randomized solutions such as Iterative Improvement and Simulated Annealing
Optimization Search Strategies
The Principle of Optimality
To use dynamic programming, the problem must observe the principle of optimality, that whatever the initial state is, remaining decisions must be optimal with regard the state following from the first decision.