Authorea

Camil Demetrescu over 8 years ago

Commit id: 85bd68045dcbabc3756a735913340543599bbcc1

deletions | additions

\end{description} \subsection{Benchmarks and Setup} We address questions Q1-3 Q1-Q3 by analyzing the performance of \osrkit\ on a selection of the \shootout\ benchmarks~\cite{shootout} running in a proof-of-concept virtual machine. machine we developed in LLVM. In particular, we focus on single-threaded benchmarks that do not rely on external libraries to perform their core computations. Benchmarks and their description are reported in \mytable\ref{tab:shootout}; four of them ({\tt b-trees}, {\tt mbrot}, {\tt n-body} and {\tt sp-norm}) are evaluated against two workloads of different size. %In this section we present a preliminar experimental study of our OSR implementation in TinyVM. Our experiments are based on the \shootout\ test suite, also known as the Computer Language Benchmark Game~\cite{shootout}. In particular, we focus on single-threaded benchmarks that do not rely on external libraries to perform their core computations.

%We generated the IR modules for our experiments with {\tt clang}, starting from the C version of the \shootout\ suite. In the version of the code we will refer to as {\em unoptimized}, no LLVM optimization passes were performed on the code other than {\em mem2reg}, which promotes memory references to register references and constructs the SSA (Static Single Assignment) form. Starting from this version, we then generate an {\em optimized} version performing using the LLVM IR optimizer {\tt opt} at {\tt -O1} optimization level. We generate the IR modules for our experiments with {\tt clang} starting from the C version of the \shootout\ suite. In To cover scenarios where OSR machinery is inserted in programs with different optimization levels, we consider two versions: 1) {\em unoptimized}, where the only LLVM optimization we perform is {\tt mem2reg} to promote stack references to registers and construct the SSA form; 2) {\em optimized}, where we apply {\tt opt} {\tt -O1} to the unoptimized version. %In the version of the code we will refer to as {\em unoptimized}, the only LLVM optimization pass we perform is {\tt mem2reg}, which promotes stack references to registers and constructs the SSA form. \ifdefined %\ifdefined \fullver Starting %Starting from this version, we also generate an {\em optimized} version using the LLVM IR optimizer {\tt opt} at {\tt -O1} optimization level. \else Starting %\else %Starting from this version, we also generate an {\em optimized} version using the LLVM optimizer {\tt opt -O1}. \fi %\fi For question Q4, we analyze the impact of the optimization technique presented in \mysection\ref{ss:eval-opt-mcvm} on the running time of a few numeric benchmarks, namely {\tt odeEuler}, {\tt odeMidpt}, {\tt odeRK4}, and {\tt sim\_anl}. The first three benchmarks~\cite{recktenwald2000numerical} solve an ODE for heat treating simulation using the Euler, midpoint, and Range-Kutta method, respectively; the last benchmark minimizes the six-hump camelback function with the method of simulated annealing\footnote{\url{http://www.mathworks.com/matlabcentral/fileexchange/33109-simulated-annealing-optimization}}.

\subsection{Results} \label{ss:experim-results} We %We now present the main results of our experimental evaluation, aimed at addressing the following questions: %\begin{itemize} %\item {\bf Message 1}: how much does a never-firing OSR point impact code quality? We run a program with one or more OSR points, and we measure the slowdown given by factors such as cache effects (due to code bloat), register pressure, etc. due to the presence of the OSR points. %\item {\bf Message 2}: what is the overhead of an OSR transition to the same function? We run a program with a controlled OSR transition, e.g., with a counter that fires the OSR. Here we measure the impact of the actual OSR call. We compute for each benchmark: 1) the average time per OSR transition; 2) the number of transferred live variables; 3) the total benchmark time with an always-firing OSR at each iteration of the hottest loop; 4) the total benchmark time with a never-firing OSR at each iteration of the hottest loop (baseline); 5) the number of iterations of the hottest loop (equals the number of OSR transitions).

\end{table*} \ifauthorea{\newline}{} For iterative benchmarks, we insert an OSR point in the body of their hottest loops. We classify a loop as hottest when its body is executed for a very high cumulative number of iterations (e.g., from a few thousands millions up to billions) and it either calls the method with the highest {\em self} time in the program, or it performs the most computational-intensive operations for the program in its own body. \ifdefined \fullver These loops are natural candidates for OSR point insertion: for instance, the Jikes RVM inserts yield points on backward branches to trigger operations such as method recompilation through OSR and thread preemption for garbage collection. In the \shootout\ benchmarks, the number of such loops is typically 1 (2 for {\tt spectral-norm}). \else

Results for the unoptimized and optimized versions of the benchmarks are reported in \myfigure\ref{fig:code-quality-base} and \myfigure\ref{fig:code-quality-O1}, respectively. For both scenarios we observe that the overhead is very small, i.e. less than $1\%$ for most benchmarks and less than $2\%$ in the worst case. For some benchmarks, code might run slightly faster after OSR point insertion due to instruction cache effects. %We analyzed the code produced by the x86-64 back-end: the OSR machinery is lowered into three native instructions that load a counter in a register, compare it against a constant value and jump to the OSR block accordingly. The number of times the OSR condition is checked for each benchmark is the %the same as in the experiments reported in \mytable\ref{tab:sameFun}. \paragraph{Q2: Overhead of OSR Transitions.} \mytable\ref{tab:sameFun} reports estimates of the average cost of performing an OSR transition to a clone of the running function. For each benchmark we compute the time difference between the scenarios in which an always- always-firing and a never-firing resolved OSR point is inserted in the code, respectively; we then normalize this difference against the number of fired OSR transitions. Hot code portions for OSR point insertion have been identified as in the Q1. %the experiments for code quality. \ifdefined\fullver %%%%%% However, as for hot loops we want to perform an OSR transition at each iteration, inserting an always-firing OSR point in the enclosing function is not an option, because the function we OSR into should then fire an OSR itself, leading eventually to a very large number of active stack frames. \fi %%%%%%%%%%%%% Depending on the characteristics of the hot loop, we either transform it its body into a separate function and instrument its entrypoint, or, when the loop calls a method with a high self time, we insert an OSR point at the beginning of that method. Normalized differences reported in the table represent a reasonable estimate of the average cost of firing a single OSR transition, transition. %, which in other words is the cost of performing a function call passing the live variables as arguments. Reported numbers are in the order of nanoseconds, and might be negative due to instruction cache effects. \paragraph{Q3: OSR Machinery Generation.} We now discuss the overhead of the \osrkit\ library for inserting OSR machinery in the IR of a function. \mytable\ref{tab:instrTime} reports for each benchmark the number of IR instructions in the instrumented function, the number of live values to transfer function and the time spent in the IR manipulation. Locations for OSR points are chosen as in the %the experiments about code quality, Q1, and the target function is a clone of the source function. \begin{table}[hb] \begin{small} \begin{tabular}{ |c|c|c|c|c|c|c| } \cline{3-7} \multicolumn{2}{l|}{} & \multicolumn{2}{c|}{{\em Open OSR}} OSR {\tiny$(\mu s)$}}} & \multicolumn{3}{c|}{{\em Resolved OSR}} OSR {\tiny$(\mu s)$}}} \\ \cline{3-7} \multicolumn{2}{l|}{} & Insert & Gen. & Insert & \multicolumn{2}{|c|}{Generate \fosrto} \\ \cline{1-2} \cline{6-7}

sp-norm & 28 & 15.31 & 27.50 & 12.41 & 154.54 & 5.52 \\ \hline \end{tabular} \caption{\label{tab:instrTime} Q3: OSR machinery insertion in optimized code. Time measurements are expressed in microseconds. Results for unoptimized code are very similar and thus not reported.} \end{small} \end{table} \ifauthorea{\newline}{}

\paragraph{Discussion.} %Experimental results presented in this section suggest that inserting an OSR point is unlikely to degrade the quality of generated code, and the time required to fire an OSR transition is negligible (i.e., order of nanoseconds). Instrumenting the original IR is cheap, while the cost of generating a continuation function - either when inserting a resolved OSR point, or from the callback method invoked at an open OSR transition - is likely to be dominated by the cost of its compilation. For a front-end, the choice whether to insert an OSR point into a function for dynamic optimization depends on the trade-off between the expected benefits in terms of execution time and the overhead for generating on optimized version of the function and eventually JIT-compiling it; compared to these two operations, the cost of OSR-related operations is negligible. Experimental results presented in this section suggest that inserting an OSR point isa quick operation and is unlikely to degrade the quality of generated code. code (Q1). The time required to fire an OSR transition is negligible (i.e., order of nanoseconds), nanoseconds, Q2), while the cost of OSR-point insertion and of generating a continuation function - %- either when inserting a resolved OSR point, or from the callback method invoked at an open OSR transition - is likely to be dominated by the cost of its compilation. compilation (Q3). For a front-end, the choice whether to insert an OSR point into a function for dynamic optimization merely depends on the trade-off between the expected benefits in terms of execution time and the overheads from generating and JIT-compiling an optimized version of the function; compared to these two operations, the cost of OSR-related operations is negligible. \paragraph{Q4: Optimizing {\tt feval} in MATLAB.} %\label{sse:feval-results}

sim\_anl & 1.009 & 1.564 & 1.606 & 1.612 \\ \hline \end{tabular} \caption{\label{tab:feval} Q4: Speedup comparison for \feval\ optimization.} \end{small} \end{table}

\label{fig:code-quality-O1} Q1: Impact on running time of never-firing OSR points inserted inside hot code portions (optimized code).

\label{fig:code-quality-base} Q1: Impact on running time of never-firing OSR points inserted inside hot code portions (unoptimized code).