Authorea

Daniele Cono D'Elia edited case-study.tex over 8 years ago

Commit id: bc2d2b04d4bcd96230cd489cfe966dc2c4d7bd00

deletions | additions

\noindent In fact, since the type inference engine yields more accurate results for $f^{IIR}_{opt}$ compared to $f^{IIR}$, the IIR compiler can in turn generate efficient specialized IR code for representing and manipulating IIR variables, and compensation code is typically required to unbox or downcast some of the live values passed at the OSR point. Compensation code might also be required to materialize an IR object for an IIR variable that were previously accessed through get/set methods from the environment. Once a state mapping object has been constructed, the optimizer calls our OSR library to generate the continuation function for the OSR transition and eventually compiles it. A pointer to the compiled function is stored in the code cache and returned to the stub, which invokes it through an indirect call passing the live state saved at the OSR point.\subsection{Experimental Results} We evaluated the effectiveness of our technique on four benchmarks, namely {\tt odeEuler}, {\tt odeMidpt}, {\tt odeRK4}, and {\tt sim\_anl}. The first three benchmarks solve an ODE for heat treating simulation using the Euler, midpoint, and Range-Kutta method, respectively\footnote{\url{http://web.cecs.pdx.edu/~gerry/nmm/mfiles/}}; the last benchmark minimizes the six-hump camelback function with the method of simulated annealing\footnote{\url{http://www.mathworks.com/matlabcentral/fileexchange/33109-simulated-annealing-optimization}}. We report the speed-ups enabled by our technique in \mytable\ref{tab:feval}, using the running times for McVM's \feval\ default dispatcher as baseline. As the dispatcher typically JIT-compiles the invoked function, we also analyzed running times when the dispatcher calls a previously compiled function. In the last column, we show speed-ups from a modified version of the benchmarks in which each \feval\ call is replaced by hand with a direct call to the function in use for the specific benchmark. \begin{table} \begin{small} % dirty hack for text wrapping \begin{tabular}{ |c|c|c|c|c| } \cline{2-5} \multicolumn{1}{c|}{} & Base & Optimized & Optimized & Direct \\ \cline{1-1} {\em Benchmark} & (cached) & (JIT) & (cached) & (by hand) \\ \hline \hline odeEuler & 1.046 & 2.796 & 2.800 & 2.828 \\ \hline odeMidpt & 1.014 & 2.645 & 2.660 & 2.685 \\ \hline odeRK4 & 1.005 & 2.490 & 2.582 & 2.647 \\ \hline sim\_anl & 1.009 & 1.564 & 1.606 & 1.612 \\ \hline \end{tabular} \caption{\label{tab:feval} Speedup comparison for \feval\ optimization.} \end{small} \end{table} Unfortunately, we are unable to compute direct performance metrics for the solution by Lameed and Hendren since its source code has not been released. Numbers in their paper~\cite{lameed2013feval} show that for these benchmarks the speed-up of the OSR-based approach is equal on average to a $30.1\%$ percentage of the speed-up from hand-coded calls, ranging from $9.2\%$ to $73.9\%$; for the JIT-based approach the average percentage grows to $84.7\%$, ranging from $75.7\%$ to $96.5\%$. Our optimization technique yields speed-ups that are very close to the upper bound given from by-hand optimization; in the worst case - {\tt odeRK4} benchmark - we observe a $94.1\%$ percentage when the optimized code is generated on-the-fly, which becomes $97.5\%$ when a cached version is available. Compared to their OSR-based approach, the compensation entry block is a key driver of improved performance, as the benefits from a better type-specialized whole function body outweigh those from performing a direct call using boxed arguments and return values in place of the original \feval.