dcdelia  over 8 years ago

Commit id: f26cb34368362243941e7db1ec82d15b882e9f72

deletions | additions      

       

@book{recktenwald2000numerical,  title={Numerical Methods with {MATLAB}: Implementations and Applications},  author={Recktenwald, G.W.},  isbn={9780201308600},  lccn={00042758},  series={Featured Titles for Numerical Analysis Series},  url={http://books.campusvirtualsp.org/books?id=H\_QeAQAAIAAJ},  year={2000},  publisher={Prentice Hall}  }  @inproceedings{lattner2004llvm,  author = {Lattner, Chris and Adve, Vikram},  title = {{LLVM}: A Compilation Framework for Lifelong Program Analysis \& Transformation},         

%TinyVM is thus an ideal playground to exercise our OSR technique, and we use it to run performance measurements on the shootout test suite, also known as the Computer Language Benchmark Game~\cite{shootout}.   %At the end of the section, we show experimental results for the case study presented in \mysection\ref{se:case-study}.  \begin{table}[b] \begin{table}[hb]  \begin{center}  \begin{small}  \begin{tabular}{ |c|c| } 

Starting from this version, we also generate an {\em optimized} version using the LLVM optimizer {\tt opt -O1}.  \fi  For question Q4, we analyze the impact of the optimization technique presented in \mysection\ref{sse:optimizing_feval} on the running time of a few numeric benchmarks, namely {\tt odeEuler}, {\tt odeMidpt}, {\tt odeRK4}, and {\tt sim\_anl}. The first three benchmarks benchmarks~\cite{recktenwald2000numerical}  solve an ODE for heat treating simulation using the Euler, midpoint, and Range-Kutta method, respectively\footnote{\url{http://web.cecs.pdx.edu/~gerry/nmm/mfiles/}}; respectively;  the last benchmark minimizes the six-hump camelback function with the method of simulated annealing\footnote{\url{http://www.mathworks.com/matlabcentral/fileexchange/33109-simulated-annealing-optimization}}. %\footnote{\url{http://web.cecs.pdx.edu/~gerry/nmm/mfiles/}}  All the experiments were performed on an octa-core 2.3 Ghz Intel Xeon E5-4610 v2 with 256+256KB of L1 cache, 2MB of L2 cache, 16MB of shared L3 cache and 128 GB of DDR3 main memory, running Debian Wheezy 7, Linux kernel 3.2.0,  \ifdefined \fullver 

%\item {\bf Message 3}: what is the overhead of the library for inserting OSR points? We compute for each benchmark the time required by insertOpenOSR (OSR point insertion + stub creation) and insertFinalizedOSR (OSR point insertion + generation of continuation function).  %\end{itemize}  \paragraph{[Q1] Impact on Code Quality.}  In order to measure how much a never-firing OSR point might impact code quality, we analyzed the source-code structure of each benchmark and profiled its run-time behavior to identify performance-critical sections for OSR point insertion. The distinction between open and resolved OSR points is nearly irrelevant in this context: we choose to focus on open OSR points, passing {\tt null} as the {\tt val} argument for the stub.  % figure  \ifdefined\noauthorea  \begin{figure}[t] \begin{figure}[hb]  \begin{center}  \includegraphics[width=0.95\columnwidth]{figures/code-quality-noBB/code-quality-noBB.eps}  \caption{\protect\input{figures/code-quality-noBB/caption}} 

% figure  \ifdefined\noauthorea  \begin{figure}[ht] \begin{figure}[b]  \begin{center}  %\vspace{-0.55cm}  \includegraphics[width=0.95\columnwidth]{figures/code-quality-O1-noBB/code-quality-O1-noBB.eps} 

\fi  \paragraph{Impact on Code Quality.}  In order to measure how much a never-firing OSR point might impact code quality, we analyzed the source-code structure of each benchmark and profiled its run-time behavior to identify performance-critical sections for OSR point insertion. The distinction between open and resolved OSR points is nearly irrelevant in this context: we chose to focus on open OSR points, as their calls take an extra argument for profiling that we set to {\tt null}.  For iterative benchmarks, we insert an OSR point in the body of their hottest loops. We classify a loop as hottest when its body is executed for a very high cumulative number of iterations (e.g., from a few thousands up to billions) and it either calls the method with the highest {\em self} time in the program, or it performs the most computational-intensive operations for the program in its own body.  \ifdefined \fullver 

%We analyzed the code produced by the x86-64 back-end: the OSR machinery is lowered into three native instructions that load a counter in a register, compare it against a constant value and jump to the OSR block accordingly.  The number of times the OSR condition is checked for each benchmark is the same as in the experiments reported in \mytable\ref{tab:sameFun}.  \paragraph{Overhead \paragraph{[Q2] Overhead  of OSR Transitions.} \mytable\ref{tab:sameFun} reports estimates of the average cost of performing an OSR transition to a clone of the running function. For each benchmark we compute the time difference between the scenarios in which an always- and a never-firing resolved OSR point is inserted in the code, respectively; we then normalize this difference against the number of fired OSR transitions. 

Normalized differences reported in the table represent a reasonable estimate of the average cost of firing a single OSR transition, which in other words is the cost of performing a function call passing the live variables as arguments. Reported numbers are in the order of nanoseconds, and might be negative due to instruction cache effects.  \paragraph{OSR \paragraph{[Q3] OSR  Machinery Generation.} We now discuss the overhead of the \osrkit\ library for inserting OSR machinery in the IR of a function. \mytable\ref{tab:instrTime} reports for each benchmark the number of IR instructions in the instrumented function, the number of live values to transfer and the time spent in the IR manipulation. Locations for OSR points are chosen as in the experiments about code quality, and the target function is a clone of the source function.  For open OSR points, we report the time spent in inserting the OSR point in the function and in generating the stub; both operations do not depend on the size of the function. For resolved OSR points, we report the time spent in inserting the OSR point and in generating the \fosrto\ function. 

Experimental results presented in this section suggest that inserting an OSR point is a quick operation and is unlikely to degrade the quality of generated code. The time required to fire an OSR transition is negligible (i.e., order of nanoseconds), while the cost of generating a continuation function - either when inserting a resolved OSR point, or from the callback method invoked at an open OSR transition - is likely to be dominated by the cost of its compilation. For a front-end, the choice whether to insert an OSR point into a function for dynamic optimization merely depends on the trade-off between the expected benefits in terms of execution time and the overheads from generating and JIT-compiling an optimized version of the function; compared to these two operations, the cost of OSR-related operations is negligible.  \paragraph{Optimizing \paragraph{[Q4] Optimizing  {\tt feval} in MATLAB.} %\label{sse:feval-results}  %We conclude this section with a discussion on the effectiveness of our optimization technique for McVM. In particular, we analyze its impact on the running time of a few numeric benchmarks, namely {\tt odeEuler}, {\tt odeMidpt}, {\tt odeRK4}, and {\tt sim\_anl}. The first three benchmarks solve an ODE for heat treating simulation using the Euler, midpoint, and Range-Kutta method, respectively\footnote{\url{http://web.cecs.pdx.edu/~gerry/nmm/mfiles/}}; the last benchmark minimizes the six-hump camelback function with the method of simulated annealing\footnote{\url{http://www.mathworks.com/matlabcentral/fileexchange/33109-simulated-annealing-optimization}}.  We report the speed-ups enabled by our technique in \mytable\ref{tab:feval}, using the running times for McVM's \feval\ default dispatcher as baseline. As the dispatcher typically JIT-compiles the invoked function, we also analyzed running times when the dispatcher calls a previously compiled function. In the last column, we show speed-ups from a modified version of the benchmarks in which each \feval\ call is replaced by hand with a direct call to the function in use for the specific benchmark.  \begin{table}[hb]  \begin{small}  % dirty hack for text wrapping 

\end{tabular}   \caption{\label{tab:feval} Speedup comparison for \feval\ optimization.}   \end{small}  \end{table} Unfortunately, we are unable to compute direct performance metrics for the solution by Lameed and Hendren since its source code has not been released. Numbers in their paper~\cite{lameed2013feval} show that for these benchmarks the speed-up of the OSR-based approach is equal on average to a $30.1\%$ percentage of the speed-up from hand-coded calls, ranging from $9.2\%$ to $73.9\%$; for the JIT-based approach the average percentage grows to $84.7\%$, ranging from $75.7\%$ to $96.5\%$.  Our optimization technique yields speed-ups that are very close to the upper bound given from by-hand optimization; in the worst case - {\tt odeRK4} benchmark - we observe a $94.1\%$ percentage when the optimized code is generated on-the-fly, which becomes $97.5\%$ when a cached version is available. Compared to their OSR-based approach, the compensation entry block is a key driver of improved performance, as the benefits from a better type-specialized whole function body outweigh those from performing a direct call using boxed arguments and return values in place of the original \feval.