Abstract

In this paper we use radix sort to demonstrate certain factors that effect performance of parallel approaches. we develope parallel code for radix sort using shared memory and coealased memory transfer to ensure considerable speed up. However we try to alter conditions in which different factors play roles in performance of algorithms. then we will try to analyse extent of factors in the performance and how to use the knowledge to write better algorithms. even though there are softwares to show the bottle necks it is important to analyse one algorithm such that it will act as guide to remaining thus allowing to develope better algorithms.

Introduction

Today lot of algorithms are being paralleized to grab speed up provided by acceletors like GPU's//use citation of gpu paper. Languages like CUDA enable this process lot simpler by giving low level control on instruction thus enabling programmers to produce efficient codes sufficent to particular solutions. The advantages of using GPUs for General problems widely known as General purpose GPU programming is ability to use heavy computational power and speed up codes. Not to mention avalilability of high band width of data transfer and shared memory which improves performance by large factors. Coming to our topic we use radix sort to analyse the factors that effect performance of parallel algorithms the reason to choose radix sort is due to highly parallel alogorithm and also enables to use shared memory and coelesd access memory patterns.

the main points this papers deal with are


Radix sort

Radix sort is a popular non comparing sorting algorithm which has inherently parallel algorithm whihc enable us to develope efficient parallel algorithm. In this algorithm we will develope parallel code for radix sort based up on most significant bit. First let us try to breif how radix sort // explain here with example

//use citation to add other works similar ones.
One can go through other similar works to have clear idea on parallel radix algorithm. In the implementation of radix sort we need to remember two important points one is block execution is random so parallel execution of steps should be independent of execution of blocks and the other is use of shared memory and coelesed memory access patterns to improve performance of algorithm.

thinking deeply into algorithm we can divide it into several parts such that 
As checking of each element is independent of each other we can use threads to be implemented parallely. Similarly local positions of elements can be obtained by prefix of elements. Now elements are transfered globally such that thier are placed in correct position. this completes one step and by repeating this for n number of times where n is equal to total number of bits required to represent all numbers to be sorted. 

Factors 

As we have detailed how algorithm works now let us try to analyse the factors that effect performance. Listing out some important factors that may affect parallel algorithms and we will try to justify using experimental data found in section 4//experimental data 
Now we will try to alter above factors such that we can analyse the impact on performance and thus justify their impact on our algorithm and finally we will try to improve our algorithm we anything we might find improving depending on results we get.

Discussion

As we have mentioned factors that are to be studied to understand how few factors effect. Now using experimental data detailed in section 4 // experimental data. we will try to analyse their impact. As we have observed in first figure the change in number of threads per block have considerable impact on performance. We generally think that using large number of threads will enable us to more parallel tasks. However the factor of active blocks that can be available will reduce the number of blocks executed in parallel. Even though it may seem obvious I decided to wary number of blocks and check the number of active blocks follow the pattern. Doing so we can find the realation between active blocks and performance of algorithms. 

Now from figure 2 we can see the changes occured due to change in shared memory per block but to our surprise there is not much change in performance which needs to be furthur investigated. // add the changes that you observe

also we try to compare the impact of shared memory by removing the factors instead just using global memory. So carefully detailing effects of shared memory will help effective use at the cost of developing and optimising time.// irfan this part you need to collect data just change few lines and get it done.

Coming to next point the effect of data size which in turn leads to large number of blocks to be issued and many iterations leading to difference in execution time. However this may vary along with algorithm complexity.

Experimental Results

add the graphs and tables necessary // however I donot know there are any possible tables to be listed


Conclusion

As we have discussed the impact of factors on parallel algorithms so that detailed analysis may help developing better algorithms also may help better understanding how to analyse new algorithms. 


Future Work

We intend to analyze different algorithms along with changing various parameters thus enabling deep understanding of parallel algorithms. In this paper one point is included about influence of shared memory on execution that is puzzling in behaviour even though I have tried various possibilities to understand the phenomenone there may be some more light to be shed which I will try to tackle in next project how ever it is open to readers to do some research on that area possibly in another perspective.