dcdelia Merge branch 'master' of https://github.com/camild/article-llvm-osr  over 8 years ago

Commit id: 178c52466c2126e9bc8b23b4ca2ecf3546dc6d70

deletions | additions      

       

A previous study by Lameed and Hendren~\cite{lameed2013feval} shows that the overhead of an \feval\ call is significantly higher than a direct call, especially in JIT-based execution environments such as McVM~\cite{chevalier2010mcvm} and the proprietary MATLAB JIT accelerator by Mathworks. In fact, the presence of an \feval\ instruction can disrupt the results of intra- and inter-procedural level for type and array shape inference analyses, which are key factors for efficient code generation. Furthermore, since \feval\ invocations typically require a fallback to an intepreter, parameters passed to an \feval\ are generally boxed to make them more generic.  Lameed and Hendren proposed two dynamic techniques for optimizing \feval\ instructions in McVM: {\em OSR-based} and {\em JIT-based} specialization, which work as follows.  \paragraph{OSR-based specialization.} When a loop containing an \feval\ becomes hot in a function $f$, an OSR is triggered. At that time, a specialized version $f'$ of $f$ is generated and control is diverted to $f'$. $f'$ is generated by an optimizer that attempts to replace an \feval$(g,x,y,z...)$ instruction with a direct call $g(x,y,z,...)$ with unboxed parameters, leveraging the fact that the actual values of $g$ and its arguments are known at OSR time. The direct call is guarded by a condition that checks if $g$ and the type of its parameters remain the same as observed when the OSR was fired. If the guard fails, the code falls back to executing the original \feval\ instruction.  \paragraph{JIT-based specialization.} [...]  \noindent Although the OSR-based approach is more general, it generates much less efficient code compared to the JIT-based one for two main reasons:   \begin{enumerate}  \item when a function $f$ is first compiled from MATLAB to IR by McVM, the functions it calls via \feval\ are unknown. Hence, the type inference engine is unable to infer types for their returned values, and these values must be kept boxed in heap-allocated objects and handled with slow generic instructions in the IR representation of $f$ (suitable for handling different types). For this reason, the optimized continuation functions $f'$ generated at OSR points in McVM inherit the slow generic instructions of $f$.  \item guard computation in $f'$ can be rather expensive, as it may require checking many parameters;  \end{enumerate}  \ifdefined\fullver  The first one is based on OSR: using the McOSR library~\cite{lameed2013modular}, \feval\ calls inside loops are instrumented with an OSR point and profiling code to cache the last-known types for the arguments of each \feval\ instruction. When an OSR is fired at run-time, a code generator modifies the original function by inserting a guard to choose between a fast path containing a direct call and a slow path with the original \feval\ call. The second technique is less general and uses value-based JIT compilation: when the first argument of an \feval\ call is an argument of the enclosing function, the compiler replaces each call to this function in all of its callers with a call to a special dispatcher. At run-time, the dispatcher evaluates the value of the argument to use for the \feval\ and executes either a previously compiled cached code or generates and JIT-compiles a version of the function optimized for the current value.  Although the OSR-based approach is more general, it generates much less efficient code compared to the JIT-based one for three reasons:  \begin{enumerate}  \item since the function called through \feval\ is unknown at compile time, the type inference engine is unable to infer types for the returned values, so the compiler has to generate generic instructions (suitable for handling different types) for the remainder of the code;  \item guard computation is expensive, not only because the value of the first argument, but also the types of the remaining arguments have to be checked to choose between the fast and the slow path;  \item since an \feval\ is executed through the interpreter, arguments are boxed to make them more generic before the call.  \end{enumerate}  The first one in particular is a major source of inefficiency for the OSR-based approach, since the benefits from replacing the call to the interpreter's \feval\ dispatcher with a direct call are limited compared to the optimization opportunities deriving from a better type inference on the whole body of the function. In fact, as they operate on boxed values, instructions for generic-type variables are inherently much less efficient than their counterparts for [arrays of] primitive types. While the JIT-based approach is preferable as it generates much better code, on the other hand it cannot be applied to cases in which the first argument $f$ to \feval\ is not passed as argument to the enclosing function $g$. Some possible scenarios for MATLAB programs are:  \begin{itemize}  \item $f$ is an {\tt inline} or an anonymous function defined in $g$;  \item $f$ is the return value from a previous call in $g$ to another function;  %\item $f$ is retrieved from a data structure~\cite{lameed2013feval};  \item $f$ is a constant string containing the name of a user-defined function (a typical misuse of \feval ~\cite{radpour2013refactoring}).  \end{itemize}    Lameed and Hendren conclude their paper by stating, ``It would be interesting to look at future work that combine the  strengths of both approaches". In the remaining part of this section, we extend McVM by implementing a novel optimization mechanism for \feval\ based on our OSR technique: we will show that our mechanism is as efficient as their JIT-based approach in terms of quality of generated code, and is even more general than their OSR-based approach, as it can optimize also \feval\ calls not enclosed in a loop.  \fi  \subsection{Extending McVM}  The McVM virtual machine is a complex research project developed at McGill~\cite{mcvm} and made of several software components, including: a front-end for lowering MATLAB programs to an intermediate representation called IIR that captures the high-level features of the language; an interpreter for running MATLAB functions and scripts in IIR format; a manager component to perform analyses on IIR; a JIT compiler based on LLVM for generating native code for a function, lowering McVM IIR to LLVM IR; a set of helper components to perform fast vector and matrix operations using optimized libraries such as ATLAS, BLAS and LAPACK. %The architecture of McVM is illustrated in Figure [...] 

\noindent In fact, since the type inference engine yields more accurate results for $f^{IIR}_{opt}$ compared to $f^{IIR}$, the IIR compiler can in turn generate efficient specialized IR code for representing and manipulating IIR variables, and compensation code is typically required to unbox or downcast some of the live values passed at the OSR point. Compensation code might also be required to materialize an IR object for an IIR variable that were previously accessed through get/set methods from the environment.  Once a state mapping object has been constructed, the optimizer calls our OSR library to generate the continuation function for the OSR transition and eventually compiles it. A pointer to the compiled function is stored in the code cache and returned to the stub, which invokes it through an indirect call passing the live state saved at the OSR point. %%%%%%%%%%%%%%%%%%%%%%%  \subsection{Comparison to previous approaches}  Lameed and Hendren~\cite{lameed2013feval} proposed two dynamic techniques for optimizing \feval\ instructions in McVM: {\em OSR-based} and {\em JIT-based} specialization, which work as follows.  \paragraph{JIT-based specialization.}   [...]  \paragraph{OSR-based specialization.} When a loop containing an \feval\ becomes hot in a function $f$, an OSR is triggered. At that time, a specialized version $f'$ of $f$ is generated and control is diverted to $f'$. $f'$ is generated by an optimizer that attempts to replace an \feval$(g,x,y,z...)$ instruction with a direct call $g(x,y,z,...)$ with unboxed parameters, leveraging the fact that the actual values of $g$ and its arguments are known at OSR time. The direct call is guarded by a condition that checks if $g$ and the type of its parameters remain the same as observed when the OSR was fired. If the guard fails, the code falls back to executing the original \feval\ instruction.  While this method overcomes the limitations of JIT-based specialization, supporting optimization of \feval\ instructions without , it is substantially slower for two main reasons:  \begin{enumerate}  \item when a function $f$ is first compiled from MATLAB to IR by McVM, the functions it calls via \feval\ are unknown. Hence, the type inference engine is unable to infer types for their returned values, and these values must be kept boxed in heap-allocated objects and handled with slow generic instructions in the IR representation of $f$ (suitable for handling different types). For this reason, the optimized continuation functions $f'$ generated at OSR points in McVM inherit the slow generic instructions of $f$.  \item guard computation in $f'$ can be rather expensive, as it may require checking many parameters;  \end{enumerate}  \noindent Our approach combines the flexibility of OSR-based specialization with the efficiency of JIT-based specialization, answering an open question raised by Lameed and Hendren~\cite{lameed2013feval}. Indeed, [...]  \ifdefined\fullver  The first one is based on OSR: using the McOSR library~\cite{lameed2013modular}, \feval\ calls inside loops are instrumented with an OSR point and profiling code to cache the last-known types for the arguments of each \feval\ instruction. When an OSR is fired at run-time, a code generator modifies the original function by inserting a guard to choose between a fast path containing a direct call and a slow path with the original \feval\ call. The second technique is less general and uses value-based JIT compilation: when the first argument of an \feval\ call is an argument of the enclosing function, the compiler replaces each call to this function in all of its callers with a call to a special dispatcher. At run-time, the dispatcher evaluates the value of the argument to use for the \feval\ and executes either a previously compiled cached code or generates and JIT-compiles a version of the function optimized for the current value.  Although the OSR-based approach is more general, it generates much less efficient code compared to the JIT-based one for three reasons:  \begin{enumerate}  \item since the function called through \feval\ is unknown at compile time, the type inference engine is unable to infer types for the returned values, so the compiler has to generate generic instructions (suitable for handling different types) for the remainder of the code;  \item guard computation is expensive, not only because the value of the first argument, but also the types of the remaining arguments have to be checked to choose between the fast and the slow path;  \item since an \feval\ is executed through the interpreter, arguments are boxed to make them more generic before the call.  \end{enumerate}  The first one in particular is a major source of inefficiency for the OSR-based approach, since the benefits from replacing the call to the interpreter's \feval\ dispatcher with a direct call are limited compared to the optimization opportunities deriving from a better type inference on the whole body of the function. In fact, as they operate on boxed values, instructions for generic-type variables are inherently much less efficient than their counterparts for [arrays of] primitive types. While the JIT-based approach is preferable as it generates much better code, on the other hand it cannot be applied to cases in which the first argument $f$ to \feval\ is not passed as argument to the enclosing function $g$. Some possible scenarios for MATLAB programs are:  \begin{itemize}  \item $f$ is an {\tt inline} or an anonymous function defined in $g$;  \item $f$ is the return value from a previous call in $g$ to another function;  %\item $f$ is retrieved from a data structure~\cite{lameed2013feval};  \item $f$ is a constant string containing the name of a user-defined function (a typical misuse of \feval ~\cite{radpour2013refactoring}).  \end{itemize}    Lameed and Hendren conclude their paper by stating, ``It would be interesting to look at future work that combine the  strengths of both approaches". In the remaining part of this section, we extend McVM by implementing a novel optimization mechanism for \feval\ based on our OSR technique: we will show that our mechanism is as efficient as their JIT-based approach in terms of quality of generated code, and is even more general than their OSR-based approach, as it can optimize also \feval\ calls not enclosed in a loop.  \fi