Authorea

Daniele Cono D'Elia edited experim.tex over 8 years ago

Commit id: 4864f537337c7abd4916159209a83bd9deee4637

deletions | additions

For iterative benchmarks, we insert an OSR point in the body of their hottest loops. We classify a loop as hottest when its body is executed for a very high cumulative number of iterations (e.g., from a few thousands up to billions) and it either calls the method with the highest {\em self} time in the program, or it performs the most computational-intensive operations for the program in its own body. These loops are natural candidates for OSR point insertion, as they can be used - as in the Jikes RVM - to enable more dynamic inlining opportunities, with the benefits from several control-flow (e.g., dead code elimination) and data-flow (e.g., constant propagation) optimizations based on the run-time values of the live variables. In the \shootout\ benchmarks, the number of such loops is typically 1 (2 for {\tt spectral-norm}). For {\tt b-trees} - the only benchmarkin our suite showing a recursive pattern - we insert an OSR point in the body of the method that accounts for the largest {\em self} execution time of the program. Such an OSR point might be useful to trigger recompilation of the code at a higher degree of optimization, or to enable some form of dynamic optimization (for instance, in a recursive search algorithm we might want to inline the comparator method provided by the user at the call). Results for the unoptimized and optimized versions of the benchmarks are reported in Figure~\ref{fig:code-quality-base} \myfigure\ref{fig:code-quality-base} and \ref{fig:code-quality-O1}, \myfigure\ref{fig:code-quality-O1}, respectively. For both scenarios we observe that the overhead is very small, i.e. less than $1\%$ for most benchmarks and less than $2\%$ in the worst case. In For some cases, benchmarks, code might run slightly faster after OSR point have been inserted due to cache and instruction alignment effects. We analyzed the code produced by the x86\_64 back-end: the OSR machinery is lowered into three native instructions that load a counter in a register, compare it against a constant value and jump to the OSR block accordingly. The number of times the OSR condition is checked for each benchmark is the same as in the experiments reported in Table~\ref{tab:sameFun}. \mytable\ref{tab:sameFun}. \paragraph{Overhead of OSR transitions.} Table~\ref{tab:sameFun} \mytable\ref{tab:sameFun} reports for each benchmark an estimate of the average cost of performing an OSR transition to a function identical to clone of the running one. function. For each benchmark we compute the difference in terms of total CPU time when an always-firing or a never-firing OSR point is inserted in the code, respectively; we then normalize this difference against the number of fired OSR transitions. Hot code portions for OSR point insertion are picked have been identified as in the experiments for code quality. However, asour goal for hot loops is we want to perform an OSR transition at each iteration, inserting an always-firing OSR point in the enclosing function is not an option, because the function we OSR into should then fire an OSR itself, leading eventually to a very large number of active stack frames. Depending on the characteristics of the hot loop, wecan either transform it into a separate function and instrument its entrypoint, or, when the loopessentially calls a method with a hightotal self time, we insert an OSR point at the beginning of such that method. Normalized differences reported in the table represent a reasonable estimate of the average cost of firing a single OSR transition, which in other words means is the cost of performing a function call passing the live variables as arguments. Reported numbers are in the order of nanoseconds, and might be negative due to the machine-level effects discussed for Message 1. \begin{table*} \begin{center}

\ifauthorea{\newline}{} \paragraph{OSR machinery generation.} We now discuss the overhead of the library for the insertion of OSR machinery in the IR of a function. Table~\ref{tab:instrTime} \mytable\ref{tab:instrTime} reports for each benchmark the number of IR instructions in the instrumented function, the number of live values to transfer and the time spent in the IR manipulation. For open OSR points, we report the time spent in inserting the OSR point in the original function and in generating the stub; both operations do not depend on the size of the function. For resolved OSR points, we report the time spent in inserting the OSR point and in generating the continuation function.