Bhathiya edited Section_Apache_Derby_Architecture_Before__.tex  about 8 years ago

Commit id: af5e2eea95179199b266722d345722b8195abf81

deletions | additions      

       

\begin{enumerate}  \item Parse using a parser generated by Javacc, results in a tree of query nodes  \item Bind to resolve all objects (e.g. table names)  \item Optimize to determine the best access path  \item Generation of a Java class (directly to byte code) to represent the statement plan   \item Loading of the class and creation of an instance to represent that connection’s state of the query  \end{enumerate}  2)   3) Optimize to determine the best access path  4) Generation of a Java class (directly to byte code) to  represent the statement plan  5) Loading of the class and creation of an instance to  represent that connection’s state of the query The generated statement plan is cached and can be shared by multiple connections. DDL statements (e.g. CREATE TABLE) use a common statement plan to avoid generation of a Java class file. This implementation was driven by the original goal to have a small footprint. Using the JVM’s interepter was thought to be less code than having an internal one. It does mean that the first couple of times the statement plan is executed,  it executed,it  would be interpreted. After a number of executions, the Just-In-Time (JIT) compiler will decide to compile it into native code. Thus, running performance tests will see a boost after a number of iterations. In addition, calling into Java user-supplied methods (functions and procedures) is direct, rather than through reflection.  SQL Execution is calling execute methods on the instance of the generated class that return a result set object. This result set is a Derby ResultSet class, not a JDBC one. The JDBC layer presents the Derby ResultSet as a JDBC one to the application. For a simple table scan the query would consist of a single result set object representing the table scan.  For a more complex query the top-level result set ”hides” a tree of result sets that correspond to the correct query. DML (INSERT/ UPDATE/ DELETE) are handled the same way,  with a ResultSet that performs all of its work in its open method and returns an update count. These result set objects interface with the Store layer to fetch rows from tables, indexes or perform sorts.  The Store layer of Derby Architecture splits into two main areas, access and raw. The access layer presents a conglomerate (table or index)/row based interface to the SQL layer. It handles table scans, index scans, index lookups, indexing, sorting, locking policies, transactions, isolation levels. The access layer sits on top of the raw store which  provides the raw storage of rows in pages in files, transaction logging, transaction management. JCE encryption is plugged in here at the page level. The raw store works with a pluggable  file system api that allows the data files to be stored in the Java filesystem, jar files, jar files in the classpath, or any other mechanism.  Services of Derby architecture are utility modules such as lock management, cache management, error logging etc. The services/cache component is a general purpose  caching mechanism that simply caches objects that implement the cacheable interface. It is used by the store for buffer caching, but it is also used to cache compiled plans for  SQL statements, open containers and their file descriptors, string conversions, table descriptors and maybe more. The interface/implementation split ensures that the cache algorithm  is not known by the users of the cache and can be changed with affecting them. The current algorithm is a ”clock or ring” based algorithm. After introducing the Derby architecture now we can present the Derby optimizer which is used by the SQL  compilation under the SQL layer. The optimizer currently only considers left-deep trees. That is, when looking at joins, it doesn’t consider join nodes to the right of other join nodes  - the join trees it considers are skewed to the left. Bushy trees where the join tree can take any shape are harder to implement but are often useful for big multi-way decision-support joins.  During optimization the join order is represented as an array of Optimizables. The first element in the array is the leftmost table in the join tree, and the successive elements in  the array are the joining tables in the left-deep tree. The optimizer does an exhaustive search of query plans, going through all the join orders and, for each join order, all of the access paths for that order. It remembers the cheapest complete plan it has found so far, and does cost-based search space pruning. That is, it doesn’t consider plans that are more expensive that the best plan found so far. The optimizer goes through the search space depth-first.  In other words, it adds a table to the join order, then goes through all the access paths for that table (choosing the cheapest one) before going on to the next position in the join order. When it gets all the way to the end of the join order, with all of the tables accounted for, it considers whether the plan it is looking at is the best one found so far. If so, it remembers it. Eventually, it exhausts all the tables to consider  at a given depth in the join order, and it has to back up to the previous position in the join order and consider the next table there.  Every time the optimizer adds a table to the prospective join order, it figures out which parts of the WHERE clause can be evaluated at that position in the order and pushes these  expressions down to that position so they can be used there. For example, if you have a five-way join of tables t1, t2, t3, t4 and t5, and the current permutation being considered is t3 - t1 - t2 (with t3 as the outermost table and t2 as the  innermost), it can evaluate the join ”t3.c1 = t2.c2” at the third position in the join order, so when it adds table t2 it pushes the expression down to that level in the order. Later, when it  removes t2 from the join order to consider some other table there, it pulls the expression back out of the join order.  getNextPermutation() does the adding (and deletion) of tables to the join order under consideration by the optimizer. getNextDecoratedPermutation() goes through all  the access paths for the current permutation (in the current implementation, it only has to consider the access paths for the innermost table in the join order, as the search is done depth-first). You can think of a decorated permutation as a join order with things like indexes and join strategies ”decorating” the nodes.  The Optimizable interface represents anything in the query that can have an access path. In practice an Optimizable is usually a base table, but it can be anything that appears in a FROM list (for example, standard SQL allows a UNION in the FROM list). Different Optimizables have different choices for access paths. The optimizer uses the nextAccessPath()  method on Optimizable to step through the different access paths. These include different join strategies (such as nested loop and hash join) and different conglomerates (the base table and the different indexes). Sometimes the Optimizable has to decide whether a given access path is feasible (for example, hash join is only feasible for equijoins)