Heuristic Approach for the Bushy Tree Implementation on Apache Derby Query Optimizer


Query optimization is one of the most important steps in relational databases. Almost all the relational database systems such as MySQL, Postgres, Oracle DB, Apache Derbyconsist of query optimizers. Major challenge in query optimization is deciding optimum plan to execute join query. According to some researches that has been conducted related to this problem, left-deep tree and bushy trees are two main approaches to solve the problem(Chaudhuri). Apache Derby is a relational database entirely written in java. Currently Apache Derby optimizer only consider left-deep tree in its query processing. Our approach is to implement bushy tree support for derby optimizer.However research has shown that constructing bushy tree execution plans is NP hard(G Moerkotte 1996). Therefore we are introducing heuristics that would minimize evaluations time significantly

Introduccton To Apache Derby

Derby Architecture

Before going in to the details of Derby Optimizer lets look at the Apache Derby architecture. If we consider the module view of Derby architecture it is a system comprised of a monitor and a collection of modules. The monitor is code that maps module requests to implementations based upon the request and the environment. For example with JDK 1.3 the internal request for a JDBC driver the monitor selects Derby’s JDBC 2.0 implementation, while in JDK 1.4 the driver is the JDBC 3.0 implementation. This allows Derby to present a single JDBC driver to the application regardless of JDK and internally the correct driver is loaded.

A module in Derby is a set of discrete functionality, such as a lock manager, JDBC driver, indexing method etc. A module’s interface is typically defined by a set of Java interfaces. For example the java.sql interfaces define a interface for a JDBC driver. All callers of a module do so purely through its interface to separate api from implementation. A module’s implementation is a set of classes that implement the required behavior and interfaces. Thus a module implementation can change or be replaced with a different implementation without affecting the callers’ code. Modules are either system wide (shared) or per-service with a service corresponding to a database.

This architecture allows different modules to be loaded depending on the environment and in the past also supported different product configurations out of the same code base.If we consider the layer/box view of the Derby architecture, there are four main code areas that can be identified. They are JDBC, SQL, Store and Services.

JDBC presents the only api to Derby to applications and consists of implementations of the java.sql and javax.sql classes for JDBC 2.0 and 3.0. Applications use Derby solely through its implementations of the top-level JDBC interfaces (Driver, DataSource, ConnectionPoolDataSource and XADataSource) and the remaining JDBC interfaces. The JDBC layer sits on top of the SQL layer. The SQL layer is split into two main logical areas, compilation and execution. SQL compilation is a five step process:

  1. Parse using a parser generated by Javacc, results in a tree of query nodes

  2. Bind to resolve all objects (e.g. table names)

  3. Optimize to determine the best access path

  4. Generation of a Java class (directly to byte code) to represent the statement plan

  5. Loading of the class and creation of an instance to represent that connection’s state of the query

The generated statement plan is cached and can be shared by multiple connections. DDL statements (e.g. CREATE TABLE) use a common statement plan to avoid generation of a Java class file. This implementation was driven by the original goal to have a small footprint. Using the JVM’s interpreter was thought to be less code than having an internal one. It does mean that the first couple of times the statement plan is executed,it would be interpreted. After a number of executions, the Just-In-Time (JIT) compiler will decide to compile it into native code. Thus, running performance tests will see a boost after a number of iterations. In addition, calling into Java user-supplied methods (functions and procedures) is direct, rather than through reflection.

SQL Execution is calling execute methods on the instance of the generated class that return a result set object. This result set is a Derby ResultSet class, not a JDBC one. The JDBC layer presents the Derby ResultSet as a JDBC one to the application. For a simple table scan the query would consist of a single result set object representing the table scan. For a more complex query the top-level result set ”hides” a tree of result sets that correspond to the correct query. DML (INSERT/ UPDATE/ DELETE) are handled the same way, with a ResultSet that performs all of its work in its open method and returns an update count. These result set objects interface with the Store layer to fetch rows from tables, indexes or perform sorts.

The Store layer of Derby Architecture splits into two main areas, access and raw. The access layer presents a conglomerate (table or index)/row based interface to the SQL layer. It handles table scans, index scans, index lookups, indexing, sorting, locking policies, transactions, isolation levels. The access layer sits on top of the raw store which provides the raw storage of rows in pages in files, transaction logging, transaction management. JCE encryption is plugged in here at the page level. The raw store works with a pluggable file system api that allows the data files to be stored in the Java filesystem, jar files, jar files in the classpath, or any other mechanism.

Services of Derby architecture are utility modules such as lock management, cache management, error logging etc. The services/cache component is a general purpose caching mechanism that simply caches objects that implement the cacheable interface. It is used by the store for buffer caching, but it is also used to cache compiled plans for SQL statements, open containers and their file descriptors, string conversions, table descriptors and maybe more. The interface/implementation split ensures that the cache algorithm is not known by the users of the cache and can be changed with affecting them. The current algorithm is a ”clock or ring” based algorithm.

Derby Optimizer

After introducing the Derby architecture now we can present the Derby optimizer which is used by the SQL compilation under the SQL layer. The optimizer currently only considers left-deep trees. That is, when looking at joins, it does not consider join nodes to the right of other join nodes. The join trees it considers are skewed to the left. Bushy trees where the join tree can take any shape are harder to implement but are often useful for big multi-way decision support joins.

During optimization the join order is represented as an array of Optimizables. The first element in the array is the leftmost table in the join tree, and the successive elements in the array are the joining tables in the left-deep tree. The optimizer does an exhaustive search of query plans, going through all the join orders and, for each join order, all of the access paths for that order. It remembers the cheapest complete plan it has found so far, and does cost-based search space pruning. That is, it does not consider plans that are more expensive that the best plan found so far. The optimizer goes through the search space depth-first. In other words, it adds a table to the join order, then goes through all the access paths for that table (choosing the cheapest one) before going on to the next position in the join order. When it gets all the way to the end of the join order, with all of the tables accounted for, it considers whether the plan it is looking at is the best one found so far. If so, it remembers it. Eventually, it exhausts all the tables to consider at a given depth in the join order, and it has to back up to the previous position in the join order and consider the next table there.

Every time the optimizer adds a table to the prospective join order, it figures out which parts of the WHERE clause can be evaluated at that position in the order and pushes these expressions down to that position so they can be used there. For example, if you have a five-way join of tables t1, t2, t3, t4 and t5, and the current permutation being considered is t3 - t1 - t2 (with t3 as the outermost table and t2 as the innermost), it can evaluate the join ”t3.c1 = t2.c2” at the third position in the join order, so when it adds table t2 it pushes the expression down to that level in the order. Later, when it removes t2 from the join order to consider some other table there, it pulls the expression back out of the join order.

getNextPermutation() does the adding (and deletion) of tables to the join order under consideration by the optimizer. getNextDecoratedPermutation() goes through all the access paths for the current permutation (in the current implementation, it only has to consider the access paths for the innermost table in the join order, as the search is done depth-first). You can think of a decorated permutation as a join order with things like indexes and join strategies ”decorating” the nodes. The Optimizable interface represents anything in the query that can have an access path. In practice an Optimizable is usually a base table, but it can be anything that appears in a FROM list (for example, standard SQL allows a UNION in the FROM list). Different Optimizables have different choices for access paths. The optimizer uses the nextAccessPath() method on Optimizable to step through the different access paths. These include different join strategies (such as nested loop and hash join) and different conglomerates (the base table and the different indexes). Sometimes the Optimizable has to decide whether a given access path is feasible (for example, hash join is only feasible for equatoins)