Aritra Chakrabarty added file MTech thesis/chap1.tex  almost 8 years ago

Commit id: c0f4ba320a34da92625ca7ba3315553c38296663

deletions | additions      

         

%% This is an example first chapter. You should put chapter/appendix that you  %% write into a separate file, and add a line \include{yourfilename} to  %% main.tex, where `yourfilename.tex' is the name of the chapter/appendix file.  %% You can process specific files by typing their names in at the   %% \files=  %% prompt when you run the file main.tex through LaTeX.  \chapter{Introduction}  Micro-optimization is a technique to reduce the overall operation count of  floating point operations. In a standard floating point unit, floating  point operations are fairly high level, such as ``multiply'' and ``add'';  in a micro floating point unit ($\mu$FPU), these have been broken down into  their constituent low-level floating point operations on the mantissas and  exponents of the floating point numbers.  Chapter two describes the architecture of the $\mu$FPU unit, and the  motivations for the design decisions made.  Chapter three describes the design of the compiler, as well as how the  optimizations discussed in section~\ref{ch1:opts} were implemented.  Chapter four describes the purpose of test code that was compiled, and which  statistics were gathered by running it through the simulator. The purpose  is to measure what effect the micro-optimizations had, compared to  unoptimized code. Possible future expansions to the project are also  discussed.  \section{Motivations for micro-optimization}  The idea of micro-optimization is motivated by the recent trends in computer  architecture towards low-level parallelism and small, pipelineable  instruction sets \cite{patterson:risc,rad83}. By getting rid of more  complex instructions and concentrating on optimizing frequently used  instructions, substantial increases in performance were realized.  Another important motivation was the trend towards placing more of the  burden of performance on the compiler. Many of the new architectures depend  on an intelligent, optimizing compiler in order to realize anywhere near  their peak performance  \cite{ellis:bulldog,pet87,coutant:precision-compilers}. In these cases, the  compiler not only is responsible for faithfully generating native code to  match the source language, but also must be aware of instruction latencies,  delayed branches, pipeline stages, and a multitude of other factors in order  to generate fast code \cite{gib86}.  Taking these ideas one step further, it seems that the floating point  operations that are normally single, large instructions can be further broken  down into smaller, simpler, faster instructions, with more control in the  compiler and less in the hardware. This is the idea behind a  micro-optimizing FPU; break the floating point instructions down into their  basic components and use a small, fast implementation, with a large part of  the burden of hardware allocation and optimization shifted towards  compile-time.  Along with the hardware speedups possible by using a $\mu$FPU, there are  also optimizations that the compiler can perform on the code that is  generated. In a normal sequence of floating point operations, there are  many hidden redundancies that can be eliminated by allowing the compiler to  control the floating point operations down to their lowest level. These  optimizations are described in detail in section~\ref{ch1:opts}.  \section{Description of micro-optimization}\label{ch1:opts}  In order to perform a sequence of floating point operations, a normal FPU  performs many redundant internal shifts and normalizations in the process of  performing a sequence of operations. However, if a compiler can  decompose the floating point operations it needs down to the lowest level,  it then can optimize away many of these redundant operations.   If there is some additional hardware support specifically for  micro-optimization, there are additional optimizations that can be  performed. This hardware support entails extra ``guard bits'' on the  standard floating point formats, to allow several unnormalized operations to  be performed in a row without the loss information\footnote{A description of  the floating point format used is shown in figures~\ref{exponent-format}  and~\ref{mantissa-format}.}. A discussion of the mathematics behind  unnormalized arithmetic is in appendix~\ref{unnorm-math}.  The optimizations that the compiler can perform fall into several categories:  \subsection{Post Multiply Normalization}  When more than two multiplications are performed in a row, the intermediate  normalization of the results between multiplications can be eliminated.  This is because with each multiplication, the mantissa can become  denormalized by at most one bit. If there are guard bits on the mantissas  to prevent bits from ``falling off'' the end during multiplications, the  normalization can be postponed until after a sequence of several  multiplies\footnote{Using unnormalized numbers for math is not a new idea; a  good example of it is the Control Data CDC 6600, designed by Seymour Cray.  \cite{thornton:cdc6600} The CDC 6600 had all of its instructions performing  unnormalized arithmetic, with a separate {\tt NORMALIZE} instruction.}.  % This is an example of how you would use tgrind to include an example  % of source code; it is commented out in this template since the code  % example file does not exist. To use it, you need to remove the '%' on the  % beginning of the line, and insert your own information in the call.  %  %\tagrind[htbp]{code/pmn.s.tex}{Post Multiply Normalization}{opt:pmn}  As you can see, the intermediate results can be multiplied together, with no  need for intermediate normalizations due to the guard bit. It is only at  the end of the operation that the normalization must be performed, in order  to get it into a format suitable for storing in memory\footnote{Note that  for purposed of clarity, the pipeline delays were considered to be 0, and  the branches were not delayed.}.  \subsection{Block Exponent}  In a unoptimized sequence of additions, the sequence of operations is as  follows for each pair of numbers ($m_1$,$e_1$) and ($m_2$,$e_2$).  \begin{enumerate}  \item Compare $e_1$ and $e_2$.  \item Shift the mantissa associated with the smaller exponent $|e_1-e_2|$  places to the right.  \item Add $m_1$ and $m_2$.  \item Find the first one in the resulting mantissa.  \item Shift the resulting mantissa so that normalized  \item Adjust the exponent accordingly.  \end{enumerate}  Out of 6 steps, only one is the actual addition, and the rest are involved  in aligning the mantissas prior to the add, and then normalizing the result  afterward. In the block exponent optimization, the largest mantissa is  found to start with, and all the mantissa's shifted before any additions  take place. Once the mantissas have been shifted, the additions can take  place one after another\footnote{This requires that for n consecutive  additions, there are $\log_{2}n$ high guard bits to prevent overflow. In  the $\mu$FPU, there are 3 guard bits, making up to 8 consecutive additions  possible.}. An example of the Block Exponent optimization on the expression  X = A + B + C is given in figure~\ref{opt:be}.  % This is an example of how you would use tgrind to include an example  % of source code; it is commented out in this template since the code  % example file does not exist. To use it, you need to remove the '%' on the  % beginning of the line, and insert your own information in the call.  %  %\tgrind[htbp]{code/be.s.tex}{Block Exponent}{opt:be}  \section{Integer optimizations}  As well as the floating point optimizations described above, there are  also integer optimizations that can be used in the $\mu$FPU. In concert  with the floating point optimizations, these can provide a significant  speedup.   \subsection{Conversion to fixed point}  Integer operations are much faster than floating point operations; if it is  possible to replace floating point operations with fixed point operations,  this would provide a significant increase in speed.  This conversion can either take place automatically or or based on a  specific request from the programmer. To do this automatically, the  compiler must either be very smart, or play fast and loose with the accuracy  and precision of the programmer's variables. To be ``smart'', the computer  must track the ranges of all the floating point variables through the  program, and then see if there are any potential candidates for conversion  to floating point. This technique is discussed further in  section~\ref{range-tracking}, where it was implemented.  The other way to do this is to rely on specific hints from the programmer  that a certain value will only assume a specific range, and that only a  specific precision is desired. This is somewhat more taxing on the  programmer, in that he has to know the ranges that his values will take at  declaration time (something normally abstracted away), but it does provide  the opportunity for fine-tuning already working code.  Potential applications of this would be simulation programs, where the  variable represents some physical quantity; the constraints of the physical  system may provide bounds on the range the variable can take.  \subsection{Small Constant Multiplications}  One other class of optimizations that can be done is to replace  multiplications by small integer constants into some combination of  additions and shifts. Addition and shifting can be significantly faster  than multiplication. This is done by using some combination of  \begin{eqnarray*}  a_i & = & a_j + a_k \\  a_i & = & 2a_j + a_k \\  a_i & = & 4a_j + a_k \\  a_i & = & 8a_j + a_k \\  a_i & = & a_j - a_k \\  a_i & = & a_j \ll m \mbox{shift}  \end{eqnarray*}  instead of the multiplication. For example, to multiply $s$ by 10 and store  the result in $r$, you could use:  \begin{eqnarray*}  r & = & 4s + s\\  r & = & r + r  \end{eqnarray*}  Or by 59:  \begin{eqnarray*}  t & = & 2s + s \\  r & = & 2t + s \\  r & = & 8r + t  \end{eqnarray*}  Similar combinations can be found for almost all of the smaller  integers\footnote{This optimization is only an ``optimization'', of course,  when the amount of time spent on the shifts and adds is less than the time  that would be spent doing the multiplication. Since the time costs of these  operations are known to the compiler in order for it to do scheduling, it is  easy for the compiler to determine when this optimization is worth using.}.  \cite{magenheimer:precision}  \section{Other optimizations}  \subsection{Low-level parallelism}  The current trend is towards duplicating hardware at the lowest level to  provide parallelism\footnote{This can been seen in the i860; floating point  additions and multiplications can proceed at the same time, and the RISC  core be moving data in and out of the floating point registers and providing  flow control at the same time the floating point units are active. \cite{byte:i860}}  Conceptually, it is easy to take advantage to low-level parallelism in the  instruction stream by simply adding more functional units to the $\mu$FPU,  widening the instruction word to control them, and then scheduling as many  operations to take place at one time as possible.  However, simply adding more functional units can only be done so many times;  there is only a limited amount of parallelism directly available in the  instruction stream, and without it, much of the extra resources will go to  waste. One process used to make more instructions potentially schedulable  at any given time is ``trace scheduling''. This technique originated in the  Bulldog compiler for the original VLIW machine, the ELI-512.  \cite{ellis:bulldog,colwell:vliw} In trace scheduling, code can be  scheduled through many basic blocks at one time, following a single  potential ``trace'' of program execution. In this way, instructions that  {\em might\/} be executed depending on a conditional branch further down in  the instruction stream are scheduled, allowing an increase in the potential  parallelism. To account for the cases where the expected branch wasn't  taken, correction code is inserted after the branches to undo the effects of  any prematurely executed instructions.  \subsection{Pipeline optimizations}  In addition to having operations going on in parallel across functional  units, it is also typical to have several operations in various stages of  completion in each unit. This pipelining allows the throughput of the  functional units to be increased, with no increase in latency.  There are several ways pipelined operations can be optimized. On the  hardware side, support can be added to allow data to be recirculated back  into the beginning of the pipeline from the end, saving a trip through the  registers. On the software side, the compiler can utilize several tricks to  try to fill up as many of the pipeline delay slots as possible, as  seendescribed by Gibbons. \cite{gib86}