Authorea

Daniele Cono D'Elia edited case-study.tex over 8 years ago

Commit id: 129712ecaa27d960357d2ed96ddb19eb7739f098

deletions | additions

\section{Case Study} \label{case-study} MATLAB is a popular dynamic language for scientific and numerical programming. Introduced in the late 1970s mainly as a scripting language for performing computations through efficient libraries, it has evolved over the years into a more complex programming language with support for high-level features such as functions, packages and object orientation. A popular feature of the language is the \feval\ construct, a built-in higher-order function that enable the invocation of the function passed as first argument on the set of subsequent arguments of the {\tt feval} \feval\ call, and to return the computed result. This feature is heavily used in many classes of numerical computations, such as iterative methods for the approximation of the solutions solution of an ordinary differential equations equation (ODE) and simulated annealing heuristics to locate a good approximation to the global optimum of a function in a large search space. A previous study by Lameed and Hendren~\cite{lameed2013feval} shows that the overhead of an {\tt feval} \feval\ call is significantly high compared to a direct call, especially in JIT-based execution environments such as McVM and the proprietary MATLAB JIT accelerator by Mathworks. The main reason for this overhead is that the presence of an {\tt feval} \feval\ instruction can disrupt the results of intra- and inter-procedural level for type and array shape inference analyses, which are a key ingredient for efficient code generation. Lameed and Hendren thus propose and implement in McVM two dynamic techniques for optimizing {\tt feval} \feval\ instructions. The first technique is based on OSR: using the McOSR library~\cite{lameed2013modular}, {\tt feval} \feval\ calls inside loops are instrumented with an OSR point and with profiling code to cache the last-known types for the arguments of each {\tt feval} \feval\ instruction. When an OSR is fired at run-time, a code generator modifies the original function by inserting a guard to choose between a fast path containing a direct call and a slow path with the original {\tt feval} \feval\ call. The second technique is less general and uses value-based JIT compilation: when the first argument of an {\tt feval} \feval\ call is an argument of the enclosing function, the compiler replaces each call to this function in all of its callers with a call to a special dispatcher. When the program is executed, the dispatcher will evaluate the value of the parameter for the {\tt feval} \feval\ and execute either a previously compiled cached code or generate and JIT-compile a method optimized for the current value of the argument. Although the OSR-based approach is more general, it generates much less efficient code compared to the JIT-based version for three reasons: \begin{enumerate} \item since the function called through {\tt feval} \feval\ is unknown at compile time, the type inference engine is unable to infer types for the returned values, so the compiler has to generate generic instructions (suitable for handling different types) for the remainder of the code; \item guard computation is expensive, because not only the value of the first argument, but also the types of the remaining arguments have to be checked to choose between the fast and the slow path; \item since an {\tt feval} \feval\ is executed through the interpreter, in the original functions arguments are boxed to make them more generic before the call. \end{enumerate} The first one in particular is a major source of inefficiency for the OSR-based approach, since the benefits from replacing the call to the interpreter's {\tt feval} \feval\ dispatcher with a direct call are limited compared to the optimization opportunities deriving from a better type inference on the whole body of the function. In fact, as they operate on boxed values, instructions operating on generic-type variables are inherently much less efficient than their counterparts for [arrays of] primitive types. While the JIT-based approach is preferable as it generates much better code, on the other hand it cannot be applied to cases in which the first argument $f$ to {\tt feval} \feval\ is not passed as argument to the enclosing function $g$. Some possible scenarios are: \begin{itemize} \item $f$ is an {\tt inline} or an anonymous function defined in $g$; \item $f$ is the return value from a previous call in $g$ to another function; %\item $f$ is retrieved from a data structure~\cite{lameed2013feval}; \item $f$ is a constant string containing the name of a user-defined function (a typical inappropriate use of {\tt feval} \feval\ according to~\cite{radpour2013refactoring}). \end{itemize} Lameed and Hendren conclude their paper by stating, ``It would be interesting to look at future work that combine the strengths of both approaches". In the remaining part of this section, we extend McVM by implementing a novel optimization mechanism for {\tt feval} \feval\ based on our OSR technique: we will show that our mechanism is as efficient as their JIT-based approach in terms of quality of generated code, and is even more general than their OSR-based approach, as it can optimize also {\tt feval} \feval\ calls not enclosed in a loop. \subsection{Extending McVM} The McVM virtual machine is a complex research project developed at McGill and composed of several software components, including: a front-end for lowering MATLAB programs to an intermediate representation called IIR that captures all of the high-level features of the language; an interpreter for running MATLAB functions and scripts in IIR format; a manager component to perform analyses on IIR; a JIT compiler based on LLVM for generating native code for a function, thus lowering McVM IIR to LLVM IR; a set of helper components to perform fast vector and matrix operations using optimized libraries such as ATLAS, BLAS and LAPACK. The architecture of McVM is illustrated in Figure [...] McVM implements a function versioning mechanism based on type specialization: for each IIR representation of a function, different IR versions are generated according to the types of the arguments at each call site. The number of generated versions per function is on average small (i.e., less than two), as in most cases functions are always called with the same argument types. Type specialization is the main factor for the generation of efficient code in McVM~\cite{chevalier2010mcvm}. The source code of McVM is publicly available~\cite{mcvm}, and after porting it from the legacy LLVM JIT to MCJIT, we have extended it with the following components to enable the optimization of {\tt feval} \feval\ instructions: \begin{enumerate} \item An analysis pass for {\tt feval} \feval\ instructions in the IIR representation of a function \item An extension for the IIR compiler to track the correspondence between IIR and IR objects at {\tt feval} \feval\ calls \item A helper component to insert OSR points in the IR at IIR locations annotated during the analysis pass \item A callback optimizer triggered at OSR points, which in turn is made of: \begin{enumerate} \item A profile-driven IIR generator to replace {\tt feval} \feval\ calls with direct calls \item A helper component to lower the optimized IIR function to IR and construct a state mapping \item A code caching mechanism to handle the compilation of the continuation functions \end{enumerate} \end{enumerate} The analysis pass is integrated in McVM's analysis manager and identifies optimization opportunities for functions containing {\tt feval} \feval\ instructions. A function becomes a candidate for optimization when at least one of its {\tt feval} \feval\ calls is inside a loop. We then group {\tt feval} \feval\ instructions whose first argument is reached by the same definition, using reaching definition information already computed by the analysis manager for previous optimizations. For each group we mark for instrumentation only those instructions that dominate the others, so that the function can be optimized as early as possible at run-time. The analysis pass is also able to determine whether the value of the argument can change across two executions of the same {\tt feval} \feval\ instruction, thus discriminating when a run-time guard must be inserted during the run-time optimization phase. Compared to the OSR-based approach by Lameed and Hendren, our solution is cheaper because the types for the other arguments do not need to be cached or guarded: as we will see later, the type inference engine will compute the most accurate yet sound type information in the analysis of the optimized IIR where direct calls are used. When the IIR compiler processes an annotated {\tt feval} \feval\ instruction, it will store in the metadata of the function version being compiled a copy of its variable map (i.e., a map between IIR and IR objects), the current {\tt llvm::BasicBlock*} created for the call and the {\tt llvm::Value*} object corresponding to the first argument for the {\tt feval}. \feval. The last two objects are used by the helper component as source label and profiling value for inserting an open OSR point, with the copy of the variable map being passed (along with other information) as {\tt extra} field. The open-OSR stub will in turn invoke the callback optimizer component we are about to describe in the next subsection. \subsection{Generating optimized code} The core of the optimization pipeline is the callback optimizer component, that is responsible for generating optimized code for the current function $f$ using profiling (i.e., the object containing the first argument for \feval) and contextual information passed from the open-OSR stub. As a first step, the optimizer will process the profiling object to resolve the target of the call - which we call $g$ - and check whether a previously compiled optimized function is available from the code cache. If not, a new function $f_{opt}$ is generated by cloning the IIR representation $f^{IIR}$ of $f$ into

This map is essential for the construction of a state mapping between $f^{IR}$ to $f^{IR}_{opt}$, as it is compared against the corresponding map stored during the lowering of $f$ to determine for each value in $f^{IR}_{opt}$ live at the continuation block whether: \begin{itemize} \item an {\tt llvm::Value*} from $f^{IR}$ passed as argument at the OSR point can be used directly \item or, compensation code is required for reconstructing to reconstruct its value. value before jumping to the block. \end{itemize} In fact, since the type inference engine yields more accurate results for $f^{IIR}_{opt}$ compared to $f^{IIR}$, the IIR compiler can in turn generate efficient specialized IR code for representing and manipulating IIR variables, and compensation code is typically required to perform unboxing or downcasting on the live values passed at the OSR point. Compensation code might also be required to materialize an IR object for an IIR variable previously accessed through get/set methods from the environment. Once a state mapping object has been constructed, the optimizer calls our OSR library to generate the continuation function for the OSR transition and eventually compiles it through MCJIT. A pointer to the compiled function is stored in the code cache and returned to the stub, which invokes it through an indirect call passing the live state saved at the OSR point. \subsection{Experimental results} We have evaluated the effectiveness of our technique on four benchmarks, namely {\tt odeEuler}, {\tt odeMidpt}, {\tt odeRK4}, and {\tt sim_anl}. The first three benchmarks solve an ODE for heat treating simulation using the Euler, midpoint, and Range-Kutta method, respectively; the last benchmark minimizes the six-hump camelback function with the method of simulated annealing.