Authorea

Camil Demetrescu over 8 years ago

Commit id: 5bf0f6771b1b72b4e5f765dc55c3e1962236634b

deletions | additions

%\item {\bf Transformations: } %\item {\bf Binary: } %\item {\bf Data set: } \item {\bf Run-time environment: } Linux (version 3.x). 3.x), no root password required. \item {\bf Hardware: } x86-64 CPU. \item {\bf Run-time state: } Cache-sensitive (performance measurements only). %\item {\bf Execution: } \item {\bf Output: } Measures Measurements are output to console. \item {\bf Experiment workflow: } Invoke scripts and perform a few manual steps. \item {\bf Publicly available?} Yes. \end{itemize}

\noindent A $\phi$-node {\tt \%i.01} is used to represent the index of the {\tt for} loop from the C code, and is set to {\tt \%10} when reached from the loop header (basic block {\tt \%2}) {\em after} a loop iteration. In fact, as a result of {\small \tt -O1} optimizations, when {\tt n>1} execution jumps from the function entrypoint {\tt \%0} directly into the loop body, initializing the $\phi$-node with {\tt 1}. Comparator {\tt c} is invoked with a tail call, storing its return value into virtual register {\tt \%8}. OSR points can be inserted with the {\tt INSERT\_OSR} command, which allows several combinations of features (see {\tt HELP} for an exhaustive list). In this session we will modify {\tt isord} so that when the loop body is entered for the first time, an OSR is aggressively fired: fired right away: \begin{small} \begin{verbatim}

\end{verbatim} \end{small} \noindent The method returns $1$ as result, indicating that the vector is ordered. sorted. Compared to \myfigure\ref{fig:isordascto}, the IR code generated for the OSR continuation function {\tt isordto} ({\tt DUMP isordto}) is slightly different different, as the MCJIT compiler detects that additional optimizations (e.g., loop strength reduction) are possible and performs them right away. away\footnote{Notice that, lowering to native code an IR function in MCJIT (which happens in \tinyvm\ when first executing it) may alter its IR representation. }. We expect code generated for {\tt isord\_stub} to be identical up to renaming to the IR reported in \myfigure\ref{fig:isordstub}. To show native code generated by the MCJIT back-end, we can run \tinyvm\ in a debugger with {\small\tt gdb tinyvm} and leverage the debugging interface of MCJIT. For instance, once {\tt driver} has been invoked, we can switch to the debugger with {\tt CTRL-Z} and display the x86-64 code for any compiled JIT-compiled method with: \begin{small} \begin{verbatim} (gdb) disas isordto

<+21>: cmp %rsi,%rcx <+24>: jl 0x7ffff7ff2000 <+26>: retq End of assembler dump. \end{verbatim} \end{small} \noindent To return to \tinyvm, we can use the {\tt fg} signal 0} command of in {\tt gdb}. gdb} (the prompt is not re-printed, but the shell is alive).

\item {\tt finalAlwaysFire} and {\tt finalAlwaysFire-O1}: IR code of the benchmark preprocessed by slicing the hottest loop into a separate function when needed (see \ref{ss:experim-results}). \end{itemize} \noindent Each experiment runs a warm-up phase followed by 10 identical trials. We manually collected the figures from the console output and analyzed them, computing confidence intervals. We show how to run the code using {\tt n-body} as an example. example\footnote{For {\tt rev-comp}, first run {\tt bootstrap.sh} in {\tt tinyvm/shootout/}. }. Times reported in this section have been measured in VirtualBox on an Intel Core i7 platform, i7-4980HQ CPU @ 2.80GHz, a different setup than the one discussed in \ref{ss:bench-setup}. \paragraph{Question Q1.} The purpose of the experiment is assessing the impact on code quality due to the presence of OSR points. The first step consists in generating figures for the baseline (uninstrumented) benchmark version. Go to {\small\tt /home/osrkit/Desktop/tinyvm} and type:

LOAD_IR shootout/n-body/bench.ll bench(50000000) REPEAT 10 bench(50000000) QUIT \end{verbatim} \end{small}

INSERT_OSR 5 NEVER OPEN UPDATE IN bench AT %8 CLONE bench(50000000) REPEAT 10 bench(50000000) QUIT \end{verbatim} \end{small} \noindent Note that the second line inserts a never-firing open OSR point at basic blocklabeled with {\tt \%8} (actually labeled with {\tt :8}) :8} in function {\tt bench} of file {\tt shootout/n-body/bench.ll}, using branch weight of 5\% as a hint for the LLVM native code generation back-end that OSR firing is very unlikely. The experiment duration was $\approx1$m with a time per trial of $\approx5.673$s. The ratio $5.673/5.725=0.990$ for {\tt n-body} is slightly smaller than the one reported in \ref{fig:code-quality-base} on the Intel Xeon platform. The experiment for building \ref{fig:code-quality-O1} uses scripts in {\tt bench-O1} and {\tt codeQuality-O1}. \paragraph{Question Q2.} This experiment assesses the run-time overhead of an OSR transition by measuring the duration of an always-firing OSR execution and of a never-firing OSR execution, and reporting the difference averaged over the number of fired OSRs. OSRs (\mytable\ref{tab:sameFun}). The always-firing OSR execution for {\tt n-body} (unoptimized) is as follows: \begin{small} \begin{verbatim} $ tinyvm shootout/scripts/finalAlwaysFire/n-body

%entry TO advance AT %entry AS advance_OSR bench(50000000) REPEAT 10 bench(50000000) QUIT \end{verbatim} \end{small} \noindent The second line inserts an always-firing resolved OSR point at the beginning of basic block {\tt \%entry} in function {\tt advance} of file {\small\tt shootout/n-body/finalAlwaysFire.ll}, generating a continuation function called {\tt advance\_OSR}. A branch weight of 95\% is given as a hint for to the LLVM native code generation back-end that OSR firing is a high-probability event. The time per trial was $\approx5.876$s. The never-firing OSR execution used as baseline is as follows: \begin{small}

\begin{verbatim} Time spent in stub generation: 0.000012835 sec Time spent in OSR point insertion: 0.000013219 sec Time spent in IR verification: 0.000060297 sec \end{verbatim} \end{small}

\begin{verbatim} Time spent in creating cont. func.: 0.000075849 sec Time spent in OSR point insert.: 0.000009409 sec Time spent in IR verification: 0.000069923 sec \end{verbatim} \end{small}

McVM is a virtual machine for MATLAB developed at McGill University. As a by-product of our project, we ported it from the LLVM legacy JIT to MCJIT, and later extended it with a new specialization mechanism for {\tt feval} calls. The source code for this version along with the MATLAB benchmarks listed in \mysection\ref{ss:bench-setup} are publicly available at \url{https://github.com/dcdelia/mcvm}. Experiments reported in \mytable\ref{tab:feval} (Question Q4) can be repeated using a number of scripts provided along with a McVM build in {\small\tt /home/osrkit/Desktop/mcvm/}. %Pre-requirements for McVM compilation are header files for a number of scientific libraries (ATLAS, BLAS, and LAPACKE) and the Boehm garbage collector, which can be built automatically using the script {\tt bootstrap.sh} provided in the repository. For each benchmark {\tt X}, {\small\tt benchmarks/scripts/} contains three MATLAB scripts to use as input for {\tt mcvm}:

McVM - The McLab Virtual Machine v1.0 Visit http://www.sable.mcgill.ca for more info. *********************************************** >: >: Compiling function: "testSH" Compiling function: "odeRK4" Compiling function: "testSHfun"

\end{verbatim} \end{small} \noindent The experiment duration on our platform was $\approx2$m, with an average time per trial of $\approx 19.836$s (manually computed by averaging the elapsed time figures from the console, after discarding the warm-up run). The resulting speedup for the base code caching mechanism was thus $20.142/19.836=1.015\times$, slightly different than the one reported in column {\em Base} of \mytable\ref{tab:feval} on for the Intel Xeon platform, for which we repeated each experiment $10$ times. We can now set an upper bound for speedups by measuring the running time when the code has been optimized by hand inserting direct calls in place of {\tt feval} instructions:

\end{verbatim} \end{small} \noindent In this scenario McVM can compile the whole program ahead of time, as {\tt rhsSteelHeat} is not invoked through an {\tt feval} call instruction anymore. A comparison of the running times suggests a rough $20.142/7.977=2.525\times$ speedup for by-hand optimization w.r.t. the baseline version. version (compare to column {\em Direct} in \mytable\ref{tab:feval}). We can now try to assess the speedup from our {\tt feval} optimization technique on {\tt odeRK4}:

\end{verbatim} \end{small} \noindent The execution time ratio between the base version and the optimized code that we JIT-compile is thus $20.142/8.451=2.383$. $20.142/8.451=2.383$ (compare to column {\em Opt. JIT} in \mytable\ref{tab:feval}). Notice that compensation code is generated to perform unboxing of IIR variables {\tt y} and {\tt \$t10} (``Type conversion required...'') so that execution can correctly resume from the optimized code. We can finally evaluate the speedup enabled by our code caching mechanism (\mysection\ref{ss:eval-opt-mcvm}) for the compilation of continuation functions by running: \begin{small} \begin{verbatim} $ ./mcvm -jit_feval_opt true <

\end{verbatim} \end{small} \noindent The experiment duration was $\approx1$m, with a time per trial of $\approx11.817$s (discarding the warm-up run). The resulting speedup w.r.t. is thus $20.142/8.006=2.516$. $20.142/8.006=2.516\times$ (compare to column {\em Opt. cached} in \mytable\ref{tab:feval}).