Authorea

dcdelia various changes :-) over 8 years ago

Commit id: 72f4903c91cd8a5ad5e23f6680a7be23c2af73ba

deletions | additions

\item An optimizer triggered at OSR points, which uses: \fi \ifdefined \noauthorea \begin{enumerate}[noitemsep] \begin{enumerate} \else \begin{enumerate} \fi

\newcommand{\gOptIR}{$g^{IR}_{opt}$} \subsection{Generating Optimized Code} \label{sse:optimizing_feval} The core of our optimization pipeline is the optimizer module, which is responsible for generating optimized code for the running function \gBase\ using contextual information passed by an open-OSR stub. As a first step, the optimizer inspects {\tt val} to resolve the target $f$ of the \feval\ and checks whether a previously compiled version of \gBase\ optimized for $f$ is available from the cache. \ifdefined \fullver If not, a new function \gOpt\ is generated by cloning the IIR representation \gIIR\ of \gBase\ into \gOptIIR\ and replacing all the \feval\ calls in the same group of the instrumented one with direct calls to $f$.

\section{Experimental Evaluation} \label{se:experiments} In this section we present a preliminar experimental study of \osrkit aimed at addressing the following questions: \ifdefined \noauthorea \begin{description} \else \begin{description} \fi \item[Q1] How much does a never-firing OSR point impact code quality? What kind of slowdown should we expect? \item[Q2] What is the run-time overhead of an OSR transition, for instance to a clone of the running function? \item[Q3] What is the overhead of \osrkit\ for inserting OSR points and creating a stub or a continuation function? \item[Q4] What kind of benefits can we expect by using OSR in a production environment based on LLVM? \end{description} \subsection{Benchmarks and Setup} We address questions Q1-3 by analyzing the performance of \osrkit on a selection of the \shootout\ benchmarks~\cite{shootout} running in a proof-of-concept virtual machine. In particular, we focus on single-threaded benchmarks that do not rely on external libraries to perform their core computations. Benchmarks and their description are reported in \mytable\ref{tab:shootout}; four of them ({\tt b-trees}, {\tt mbrot}, {\tt n-body} and {\tt sp-norm}) are evaluated against two workloads of different size. %In this section we present a preliminar experimental study of our OSR implementation in TinyVM. Our experiments are based on the \shootout\ test suite, also known as the Computer Language Benchmark Game~\cite{shootout}. In particular, we focus on single-threaded benchmarks that do not rely on external libraries to perform their core computations. %, a proof-of-concept virtual machine based on LLVM's JIT compiler MCJIT. TinyVM supports interactive invocations of functions and it can compile LLVM IR either generated at run-time or loaded from disk. The main design goal behind TinyVM is the creation of an interactive environment for IR manipulation and JIT-compilation of functions: for instance, it allows the user to insert OSR points in loaded functions, run optimization passes on them or display their CFGs, repeatedly invoke a function for a specified amount of times and so on. TinyVM supports dynamic library loading and linking, and comes with a helper component for MCJIT that simplifies tasks such as handling multiple IR modules, symbol resolution in presence of multiple versions of a function, and tracking native code and other machine-level generated object such as Stackmaps. %TinyVM is thus an ideal playground to exercise our OSR technique, and we use it to run performance measurements on the shootout test suite, also known as the Computer Language Benchmark Game~\cite{shootout}. The list of benchmarks and their description is reported in \mytable\ref{tab:shootout}; four of them - namely {\tt b-trees}, {\tt mbrot}, {\tt n-body} and {\tt sp-norm} - are evaluated against two workloads of different size. At %At the end of the section, we show experimental results for the case study presented in \mysection\ref{se:case-study}. \begin{table} \begin{center}

\caption{\label{tab:shootout} Description of the \shootout\ benchmarks.} \end{table} \subsection{Setup} %We generated the IR modules for our experiments with {\tt clang}, starting from the C version of the \shootout\ suite. In the version of the code we will refer to as {\em unoptimized}, no LLVM optimization passes were performed on the code other than {\em mem2reg}, which promotes memory references to register references and constructs the SSA (Static Single Assignment) form. Starting from this version, we then generate an {\em optimized} version performing using the LLVM IR optimizer {\tt opt} at {\tt -O1} optimization level. We generated the IR modules for our experiments with {\tt clang} starting from the C version of the \shootout\ suite. In the version of the code we will refer to as {\em unoptimized}, the only LLVM optimization pass we performed was is {\tt mem2reg}, which promotes memory references to register references and constructs the SSA(Static Single Assignment) form. \ifdefined \fullver Starting from this version, we also generate an {\em optimized} version performing using the LLVM IR optimizer {\tt opt} at {\tt -O1} optimization level. \else Starting from this version, we also generate an {\em optimized} version performing using the LLVM optimizer {\tt opt -O1}. \fi Experiments For question Q4, we analyze the impact of the optimization technique presented in \mysection\ref{sse:optimizing_feval} on the running time of a few numeric benchmarks, namely {\tt odeEuler}, {\tt odeMidpt}, {\tt odeRK4}, and {\tt sim\_anl}. The first three benchmarks solve an ODE for heat treating simulation using the Euler, midpoint, and Range-Kutta method, respectively\footnote{\url{http://web.cecs.pdx.edu/~gerry/nmm/mfiles/}}; the last benchmark minimizes the six-hump camelback function with the method of simulated annealing\footnote{\url{http://www.mathworks.com/matlabcentral/fileexchange/33109-simulated-annealing-optimization}}. All the experiments were performed on an octa-core 2.3 Ghz Intel Xeon E5-4610 v2 with 256+256KB of L1 cache, 2MB of L2 cache, 16MB of shared L3 cache and 128 GB of DDR3 main memory, running Debian Wheezy 7, Linux kernel 3.2.0, \ifdefined \fullver LLVM 3.6.2 (Release build, compiled using gcc 4.7.2), 64 bit. \else

\subsection{Results} We now present the main results of our experimental evaluation, aimed at addressing the following questions: \begin{description} \item[Q:] How much does a never-firing OSR point impact code quality? What kind of slowdown should we expect? \item[Q:] What is the run-time overhead of an OSR transition, for instance to a clone of the running function? \item[Q:] What is the overhead of \osrkit\ for inserting OSR points and creating a stub or a continuation function? %\item[Q4] What kind of benefits do we expect [...] \end{description} %\begin{itemize} %\item {\bf Message 1}: how much does a never-firing OSR point impact code quality? We run a program with one or more OSR points, and we measure the slowdown given by factors such as cache effects (due to code bloat), register pressure, etc. due to the presence of the OSR points. %\item {\bf Message 2}: what is the overhead of an OSR transition to the same function? We run a program with a controlled OSR transition, e.g., with a counter that fires the OSR. Here we measure the impact of the actual OSR call. We compute for each benchmark: 1) the average time per OSR transition; 2) the number of transferred live variables; 3) the total benchmark time with an always-firing OSR at each iteration of the hottest loop; 4) the total benchmark time with a never-firing OSR at each iteration of the hottest loop (baseline); 5) the number of iterations of the hottest loop (equals the number of OSR transitions).

Experimental results presented in this section suggest that inserting an OSR point is a quick operation and is unlikely to degrade the quality of generated code. The time required to fire an OSR transition is negligible (i.e., order of nanoseconds), while the cost of generating a continuation function - either when inserting a resolved OSR point, or from the callback method invoked at an open OSR transition - is likely to be dominated by the cost of its compilation. For a front-end, the choice whether to insert an OSR point into a function for dynamic optimization merely depends on the trade-off between the expected benefits in terms of execution time and the overheads from generating and JIT-compiling an optimized version of the function; compared to these two operations, the cost of OSR-related operations is negligible. \subsection{Optimizing \paragraph{Optimizing {\tt feval} in MATLAB} \label{sse:feval-results} We MATLAB.} %\label{sse:feval-results} %We conclude this section with a discussion on the effectiveness of our optimization technique for McVM. In particular, we analyze its impact on the running time of a few numeric benchmarks, namely {\tt odeEuler}, {\tt odeMidpt}, {\tt odeRK4}, and {\tt sim\_anl}. The first three benchmarks solve an ODE for heat treating simulation using the Euler, midpoint, and Range-Kutta method, respectively\footnote{\url{http://web.cecs.pdx.edu/~gerry/nmm/mfiles/}}; the last benchmark minimizes the six-hump camelback function with the method of simulated annealing\footnote{\url{http://www.mathworks.com/matlabcentral/fileexchange/33109-simulated-annealing-optimization}}. We report the speed-ups enabled by our technique in \mytable\ref{tab:feval}, using the running times for McVM's \feval\ default dispatcher as baseline. As the dispatcher typically JIT-compiles the invoked function, we also analyzed running times when the dispatcher calls a previously compiled function. In the last column, we show speed-ups from a modified version of the benchmarks in which each \feval\ call is replaced by hand with a direct call to the function in use for the specific benchmark. \begin{table}[hb] \begin{small}

%\subsection{Example} In this section we discuss one possible embodiment of the OSR approach of \mysection\ref{se:overview} in LLVM. Our discussion is based on a simple running example that illustrates a profile-driven optimization scenario. We start from a simple base function ({\tt isord}) that checks whether an array of numbers is ordered according to some criterion specified by a comparator (see \myfigure\ref{fi:isord-example}). Our goal is to instrument {\tt isord} so that, whenever the number of loop iterations exceeds a certain threshold, control is dynamically diverted to a faster version generated on the fly by inlining the comparator. The IR code shown in this section\footnote{Virtual register names and labels in the LLVM-produced IR code shown in this paper have been refactored to make the code more readable.} has been generated with \clang\ and instrumented with \osrkit, a library we prototyped to help VM builders deploy OSR in LLVM. \osrkit\ provides a number of useful abstractions that include open and resolved OSR instrumentation of IR base functions without breaking the SSA (Static Single Assignment) form, liveness analysis, generation of OSR continuation functions, and mapping of LLVM values between different versions of a program along with compensation code generation\footnote{An accompanying artifact will allow the interested reader to get acquainted with \osrkit\ and repeat the sample scenario described in this section.}. %To explain how the OSR approach of \mysection\ref{se:overview} can be implemented in LLVM, we consider the simple example of \myfigure\ref{fi:isord-example}. Function {\tt isord} checks whether an array of numbers is ordered according to some criterion specified by a comparator.