Authorea

Camil Demetrescu over 8 years ago

Commit id: e4b47ce67ccba7bb43264f961a8740bb8113b8fc

deletions | additions

\fi %%%%%%% A previous study by Lameed and Hendren~\cite{lameed2013feval} shows that the overhead of an \feval\ call is significantly higher than a direct call, especially in JIT-based execution environments such as McVM~\cite{chevalier2010mcvm} and the proprietary MATLAB JIT accelerator by Mathworks. In fact, the presence of an \feval\ instruction can disrupt the results of intra- and inter-procedural level for type and array shape inference analyses, which are key factors for efficient code generation. Furthermore, since \feval\ invocations typically require a fallback to an intepreter, interpreter, parameters passed to an \feval\ are typically boxed to make them more generic. Our case study presents a novel technique for optimizing \feval\ in the McVM virtual machine, a complex research project developed at McGill University. McVM is publicly available~\cite{mcvm} and includes: a front-end for lowering MATLAB programs to an intermediate representation called IIR that captures the high-level features of the language; an interpreter for running MATLAB functions and scripts in IIR format; a manager component to perform analyses on IIR; a JIT compiler based on LLVM for generating native code for a function, lowering McVM IIR to LLVM IR; a set of helper components to perform fast vector and matrix operations using optimized libraries such as ATLAS, BLAS and LAPACK. %The architecture of McVM is illustrated in Figure [...] McVM implements a function versioning mechanism based on type specialization, which is the main driver for generating efficient code~\cite{chevalier2010mcvm}.

\paragraph{Discussion.} The ideas presented in this section advance the state of the art of \feval\ optimization in MATLAB runtimes. %combine the flexibility of OSR-based specialization with the efficiency of the JIT-based method. Similarly to OSR-based specialization, we do not place restrictions on the functions that can be optimized. On the other hand, we work at IIR (rather than IR) level as in JIT-based specialization, which allows us to perform type inference on the code with direct calls. Working at IIR level eliminates the two main sources of inefficiency of OSR-based specialization: 1) we can replace generic istructions instructions with specialized instructions, and 2) the types of $g$'s arguments do not need to be cached or guarded as they are statically inferred. These observations are confirmed in practice by experiments on benchmarks from the MATLAB community, as we will show in \mysection\ref{ss:experim-results}.

%Starting from this version, we also generate an {\em optimized} version using the LLVM optimizer {\tt opt -O1}. %\fi For question Q4, we analyze the impact of the optimization technique presented in \mysection\ref{ss:eval-opt-mcvm} on the running time of a few numeric benchmarks, namely {\tt odeEuler}, {\tt odeMidpt}, {\tt odeRK4}, and {\tt sim\_anl}. The first three benchmarks~\cite{recktenwald2000numerical} solve an ODE ordinary differential equation for heat treating simulation using the Euler, midpoint, and Range-Kutta method, respectively; the last benchmark minimizes the six-hump camelback function with the method of simulated annealing\footnote{\url{http://www.mathworks.com/matlabcentral/fileexchange/33109-simulated-annealing-optimization}}. %\footnote{\url{http://web.cecs.pdx.edu/~gerry/nmm/mfiles/}}

\fi \paragraph{OSR Instrumentation in IR.} To defer the compilation of the continuation function until the comparator is known at run time, we used \osrkit\ to instrument {\tt isord} with an open OSR point at the beginning of the loop body, as shown in \myfigure\ref{fig:isordfrom}. Portions added to the original code by OSR istrumentation instrumentation are highlighted in grey. %The figure illustrates how the original {\tt isord} code is instrumented by \tinyvm, highlighting in grey the added portions. New instructions are placed at the beginning of the loop body to increment a hotness counter {\tt p.osr} and jump to an OSR-firing block if the counter reaches the threshold (1000 iterations in this example). The OSR block contains a tail call to the target generation stub, which receives as parameters the four live variables at the OSR point ({\tt v}, {\tt n}, {\tt i}, {\tt c}). \osrkit\ allows the stub to receive the run-time value {\tt val} of an IR object that can be used to produce the continuation function -- in our example, the pointer to the comparator function to be inlined. The stub (see \myfigure\ref{fig:isordstub}) calls a code generator that: 1) builds an optimized version of {\tt isord} by inlining the comparator, and 2) uses it to create the continuation function {\tt isordto} shown in \myfigure\ref{fig:isordascto}. The stub passes to the code generator four parameters: 1) a pointer to the {\tt isord} IR code, 2) a pointer to the basic block in {\tt isord} from which the OSR is fired, 3) a user-defined object to support code generation in MCJIT, and 4) the stub's {\tt val} parameter (the first three are hard-wired by \osrkit). The stub terminates with a tail call to {\tt isordto}. To generate the continuation function from the optimized version created by the inliner, \osrkit\ replaces the function entry point, removes dead code, replaces live variables with the function parameters, and fixes $\phi$-nodes accordingly. Additions resulting from the IR instrumentation are in grey, while removals are struck-through.

The two approaches differ in the points where code specialization is performed. In JIT-based specialization, $f'$ is generated when $f$ is called. In contrast, the OSR-based method interrupts $f$ as it executes, generates a specialized version $f'$, and resumes from it. Another technical difference, which has substantial performance implications, is the representation level at which optimization occurs in the two approaches. When a function $f$ is first compiled from MATLAB to IIR, and then from IIR to IR, the functions it calls via \feval\ are unknown and the type inference engine is unable to infer the types of their returned values. Hence, these values must be kept boxed in heap-allocated objects and handled with slow generic instructions in the IR representation of $f$ (suitable for handling different types). The JIT method works on the IIR representation of $f$ and can resort to the full power of type analysis to infer the types of the returned values of $g$, turning the slow generic instructions of $f$ into fast type-specialized instructions in $f'$. On the other hand, OSR-based specialization operates on the IR representation of $f$, which prevents the optimizer from exploting exploiting type inference. As a consequence, for $f'$ to be sound, the direct call to $g$ must be guarded by a condition that checks if $g$ and the type of its parameters remain the same as observed at the time when $f$ was interrupted. If the guard fails, the code falls back to executing the original \feval\ instruction. JIT-based specialization is less general than OSR-based specialization, as it only works if the \feval\ argument $g$ is one of the parameters of $f$, but is substantially faster due to the benefits of type inference.