dcdelia  over 8 years ago

Commit id: dd5b9fdab78fd4f3922c95ac1c65b187e35c6ce8

deletions | additions      

       

Results for the unoptimized and optimized versions of the benchmarks are reported in \myfigure\ref{fig:code-quality-base} and \myfigure\ref{fig:code-quality-O1}, respectively. For both scenarios we observe that the overhead is very small, i.e. less than $1\%$ for most benchmarks and less than $2\%$ in the worst case. For some benchmarks, code might run slightly faster after OSR point insertion due to instruction cache effects.  %We analyzed the code produced by the x86-64 back-end: the OSR machinery is lowered into three native instructions that load a counter in a register, compare it against a constant value and jump to the OSR block accordingly.  The number of times the OSR condition is checked for each benchmark is the same as in the experiments reported in \mytable\ref{tab:sameFun}.    \paragraph{Overhead of OSR Transitions.}  \mytable\ref{tab:sameFun} reports estimates of the average cost of performing an OSR transition to a clone of the running function. For each benchmark we compute the time difference between the scenarios in which an always- and a never-firing resolved OSR point is inserted in the code, respectively; we then normalize this difference against the number of fired OSR transitions.  Hot code portions for OSR point insertion have been identified as in the experiments for code quality. However, as for hot loops we want to perform an OSR transition at each iteration, inserting an always-firing OSR point in the enclosing function is not an option, because the function we OSR into should then fire an OSR itself, leading eventually to a very large number of active stack frames. Depending on the characteristics of the hot loop, we either transform it into a separate function and instrument its entrypoint, or, when the loop calls a method with a high self time, we insert an OSR point at the beginning of that method.  Normalized differences reported in the table represent a reasonable estimate of the average cost of firing a single OSR transition, which in other words is the cost of performing a function call passing the live variables as arguments. Reported numbers are in the order of nanoseconds, and might be negative due to instruction cache effects. \begin{table*} \begin{center}  \begin{small}  \begin{tabular}{ |c|c|c|c|c|c|c|c| } 

\end{table*}  \ifauthorea{\newline}{}  \paragraph{Overhead of OSR Transitions.}  \mytable\ref{tab:sameFun} reports estimates of the average cost of performing an OSR transition to a clone of the running function. For each benchmark we compute the time difference between the scenarios in which an always- and a never-firing resolved OSR point is inserted in the code, respectively; we then normalize this difference against the number of fired OSR transitions.  Hot code portions for OSR point insertion have been identified as in the experiments for code quality. However, as for hot loops we want to perform an OSR transition at each iteration, inserting an always-firing OSR point in the enclosing function is not an option, because the function we OSR into should then fire an OSR itself, leading eventually to a very large number of active stack frames. Depending on the characteristics of the hot loop, we either transform it into a separate function and instrument its entrypoint, or, when the loop calls a method with a high self time, we insert an OSR point at the beginning of that method.  Normalized differences reported in the table represent a reasonable estimate of the average cost of firing a single OSR transition, which in other words is the cost of performing a function call passing the live variables as arguments. Reported numbers are in the order of nanoseconds, and might be negative due to instruction cache effects.  \paragraph{OSR Machinery Generation.}  We now discuss the overhead of the \osrkit\ library for inserting OSR machinery in the IR of a function. \mytable\ref{tab:instrTime} reports for each benchmark the number of IR instructions in the instrumented function, the number of live values to transfer and the time spent in the IR manipulation. Locations for OSR points are chosen as in the experiments about code quality, and the target function is a clone of the source function. 

Not surprisingly, constructing a continuation function takes longer than the other operations (i.e., up to 1 ms vs. 20-40 us), as it involves cloning and manipulating the body of the target function and thus depends on its size: \mytable\ref{tab:instrTime} thus comes with an additional column in which time is normalized against the number of IR instructions in the target.  \begin{table} \begin{table}[hb]  \begin{small}  \begin{tabular}{ |c|c|c|c|c|c|c| }  \cline{3-7}  \multicolumn{2}{l|}{} & \multicolumn{2}{c|}{{\em Open OSR}} & \multicolumn{3}{c|}{{\em Resolved OSR}} \\   \cline{3-7}  \multicolumn{2}{l|}{} & Insert & Create Gen.  & Insert & \multicolumn{2}{|c|}{Create \multicolumn{2}{|c|}{Generate  \fosrto} \\ \cline{1-2} \cline{6-7}  {\em Benchmark} & \textbar IR\textbar & point & stub & point & Total & Avg/inst \\  \hline 

\label{sse:feval-results}  We conclude this section with a discussion on the effectiveness of our optimization technique for McVM. In particular, we analyze its impact on the running time of a few numeric benchmarks, namely {\tt odeEuler}, {\tt odeMidpt}, {\tt odeRK4}, and {\tt sim\_anl}. The first three benchmarks solve an ODE for heat treating simulation using the Euler, midpoint, and Range-Kutta method, respectively\footnote{\url{http://web.cecs.pdx.edu/~gerry/nmm/mfiles/}}; the last benchmark minimizes the six-hump camelback function with the method of simulated annealing\footnote{\url{http://www.mathworks.com/matlabcentral/fileexchange/33109-simulated-annealing-optimization}}. We report the speed-ups enabled by our technique in \mytable\ref{tab:feval}, using the running times for McVM's \feval\ default dispatcher as baseline. As the dispatcher typically JIT-compiles the invoked function, we also analyzed running times when the dispatcher calls a previously compiled function. In the last column, we show speed-ups from a modified version of the benchmarks in which each \feval\ call is replaced by hand with a direct call to the function in use for the specific benchmark.    \begin{table} \begin{table}[hb]  \begin{small}  % dirty hack for text wrapping  \begin{tabular}{ |c|c|c|c|c| }