Parallel implementations of a TV-\(L^{1}\) image-denoising algorithm

I describe parallel implementations of a total variational image-denoising algorithm. Specifically, the primal-dual formulation of the TV-\(L^{1}\) model is implemented in both the shared memory and distributed memory paradigms using the Open Multi-Processing (OpenMP) API and the Message Passing Interface (MPI), respectively. Experiments on the Stampede supercomputer show good weak and strong scaling performance. Weak scaling performance suggests that the distributed implementation may be particularly suited to very large problems where the overhead associated with passing and receiving messages is much less than the local work.

Introduction

Imaging acquisition involves many hardware and software stages that introduce error sources. This is seen as visual artifacts in the image, typically recognized as noise. This can be especially noticeable in images acquired with low levels of illumination, such as night photography.

Two common manifestations of noise in digital images are Gaussian noise and salt-and-pepper noise. Gaussian noise is typically associated with errors in detection. It produces pixel values that vary within a quasi-normally distributed range about the ”true” value at that point in the image. Salt-and-pepper noise typically arises from transmission errors, and the pixel value is recorded as either fully on or fully off (in grayscale, white or black). Removing these artifacts can be desirable from an aesthetic perspective or in order to pre-process images for other workflows. Common techniques to address this include Gaussian blurring, which takes the convolution of an image with a Gaussian kernel window, and median filtering, which replaces pixels with the median value of a sliding window. These techniques can reduce noise; however, they are also susceptible to blurring edges of features.

The total variation technique was introduced in 1992 by Rudin, Osher and Fatemi (ROF) (Rudin 1992) as an alternative denoising method. The method works by iteratively constructing a function \(u\) on a domain \(\Omega\) that minimizes its energy with an input function \(f\):

\begin{equation} \min_{u}\int_{\Omega}\left\|{\nabla{u}}\right\|+\lambda\int_{\Omega}(u-f)^{2}\\ \end{equation}

with \(\|\cdot\|\) being the \(L^{2}\) norm. The first term is the total variation of the image. It is a regularizer for the minimization function. The second term also includes an \(L^{2}\) norm, which means that the problem is convex and has a unique solution (Chambolle 2009).

Replacing the second term above with an \(L^{1}\) norm leads to the TV-\(L^{1}\) model:

\begin{equation} \min_{u}\int_{\Omega}\left\|{\nabla{u}}\right\|+\lambda\int_{\Omega}\left|{u-f}\right|\\ \end{equation}

The TV-\(L^{1}\) model is not a (necessarily) convex model; however, when applied to discrete images, the model tends to offer better performance at removing salt-and-pepper noise from images than the ROF model and is thus an important image processing technique. It is a standard method implemented in many computer vision packages, including OpenCV (opencv: {Open} {Sourc...).

Algorithm

For discrete images in a TV-\(L^{1}\) model, \(\Omega\subset\mathbb{R}^{2}\) and the integrals become sums over the pixels. The discretization of the TV-\(L^{1}\) problem is well suited for parallelization. At all stages of the calculation, only immediate neighbors are needed. Each stage of the calculation is also completely inside of a loop over each pixel, making the shared memory OpenMP implementation particularly straightforward.

A pseudocode representation of the algorithm is given below. The interpretation of the parameters are not included here; the function descriptions are provided to illustrate the algorithm dependencies. For full details of the parameters, see (Chambolle 2010).

{algorithm}

\label{euclid}Primal-dual TV-\(L^{1}\) denoising (after Chambolle and Pock) {algorithmic}[1] \ProcedureTVL1 \State\(X_{0}\leftarrow image\) \State\((px,py)\leftarrow\nabla(X_{0})\) \For\(iter:=0,maxiter\) \ForAll\((i,j)\in X_{iter}\): \State\((dx,dy)\leftarrow\nabla(X_{iter})\) \State\((px,py)\leftarrow\textbf{{project}}(px+\sigma dx,py+\sigma dy)\) \State\(nablat\leftarrow\nabla^{T}(px,py)\) \State\(X^{\prime}\leftarrow\textbf{{shrink}}(X_{iter}-\tau nablat,\lambda\tau)\) \State\(X_{iter+1}\leftarrow X_{iter}+(X_{iter}-X^{\prime})\) \EndFor\EndFor\Return\(X_{maxiter}\)\CommentThe final update. \EndProcedure

The calculation of the gradient for an image \(u\) on \(i\) by \(j\) pixels is given by

\begin{equation} \nabla{u_{i,j}}=\left(\begin{matrix}dx_{i,j}\\ dy_{i,j}\end{matrix}\right)=\left(\begin{matrix}u_{i+1,j}-u_{i,j}\\ u_{i,j+1}-u_{i,j}\end{matrix}\right)\\ \end{equation}

The project function takes the partial gradients \(dx\) and \(dy\) and projects them into a unit circle: If the length of the local gradient is larger than unity, the gradient is clipped by the unit circle; otherwise it is unchanged.

\begin{aligned} \nonumber \\ \textbf{anorm}_{i,j} & =\sqrt{dx^{2}_{i,j}+dy^{2}_{i,j}} \\ \textbf{project}\left(\begin{matrix}dx_{i,j}\\ dy_{i,j}\end{matrix}\right) & =\begin{cases}\left(\begin{matrix}dx_{i,j}\\ dy_{i,j}\end{matrix}\right)&\textbf{anorm}_{i,j}\leq 1\\ \frac{1}{\textbf{anorm}_{i,j}}\left(\begin{matrix}dx_{i,j}\\ dy_{i,j}\end{matrix}\right)&\textrm{otherwise}\end{cases}\\ \end{aligned}

The reverse transformation from projected gradients to pixel values is constructed directly from the gradient definition above.

\begin{equation} \nabla^{T}\left(\begin{matrix}dx_{i,j}\\ dy_{i,j}\end{matrix}\right)=dx_{i-1,j}-dx_{i,j}+dy_{i,j-1}-dy_{i,j}\\ \end{equation}

Finally, the shrink function clips the allowable difference of the transformed image from the \(iter\)-th solution \(X_{iter}\). The solution is then updated from this clipped value.

Implementation and Methodology

I wrote the implementations in C, following the general scheme above and using the nomenclature (also used in the function descriptions above) from a Python implementation written by Alexander Mordvintsev (Mordvintsev 2013). Images are passed into the routine and the results are written out as PPM files using \({\tt ppma\_io.c}\) by John Burkardt (Burkardt 1998). These reads and writes from disk are not included in the timing of the codes, but writing the results out is useful for validating that the filtering operation performs as expected.

When implemented using MPI, it is necessary to buffer the data in a way that allows for border values to be sent between processors. The general scheme for this buffering is illustrated in Figure \ref{fig:CellLayout}. Only the gradient operation and its inverse require communication steps.

The results of two MPI implementations are presented here. The first uses non-blocking communication to send and receive messages, and it initializes (with zero) the entries of the output arrays. The second also uses non-blocking communications, but does more interleaving of operations by performing all calculations in the array interior before processing the receive messages and calculating the array borders.

\label{fig:CellLayout}Arrangement of data for subproblems in a distributed memory implementation. To calculate the gradients \(dx\) and \(dy\), ghosts are needed below and to the right of the subproblem borders. For the reverse transformation, ghosts are needed above and to the left of the subproblem borders.

The indexing of the matrices is done in one dimension, and the location in the matrix is calculated based on the data held in the matrix. Gradients and their projections must account for the ghost cells in the first column and first row, and images and their transformations must account for the ghost cells in the last column and last row.

To test the performance of the parallel implementations, it is necessary to measure both weak scaling and strong scaling. For weak scaling, I chose two different work sizes per processor to identify any noticeable difference in scaling with varying computation to communication ratio. The first work size is \(64\times 64\) pixels per processor. The source image is derived from the \(512\times 512\) ”ascent” public domain reference image included in the Python SciPy package (Jones 2001--). Although not strictly necessary to test the performance of the algorithm, both Gaussian and salt-and-pepper noise were added to the reference image so that the accuracy of the implementation could be compared against (Mordvintsev 2013). The image is truncated in each dimension by multiples of 64 pixels, and the test is performed on \(\{i^{2}|i\in\mathbb{Z}^{+},i\leq 8\}\) processors for the MPI implementations and \(\{i^{2}|i\in\mathbb{Z}^{+},i\leq 4\}\) for the OpenMP implementation. This latter restriction is necessitated by the limitation on the number of processes per node.

\label{fig:NoisySteps}\(512\times 512\) pixel source image used for the ”small” (\(64\times 64\) pixel per processor) weak scaling test.

\label{fig:cleansteps}The result after 100 iterations of the the TV-\(L^{1}\) denoising filter applied to Figure \ref{fig:NoisySteps}. This output comes from the OpenMP implementation tested in this report.

\label{fig:uo}\(1530\times 2444\) pixel source image used for the strong scaling test and ”large” (\(190\times 190\) pixel per processor) weak scaling test.

For the ”large” weak scaling test, the work size is \(190\times 190\) pixels per processor. The source image for this test comes from the Urban Observatory (UO), run by the NYU Center for Urban Science and Progress. The image is sized \(1530\times 2444\) pixels with three color channels. This nighttime image shows a large amount of noise, especially in the red channel. It is converted to a single-channel grayscale image by taking the mean across the three channels. The image is then truncated to \(8\times 190=1520\) pixels per side, and then truncated again in each dimension by multiples of 190 pixels for each of the same processor counts as in the small weak scaling test.

For the strong scaling test, the full size \(1530\times 2444\) pixel grayscale image is used, and each implementation is tested on the same numbers of processors as in the weak scaling tests. One caveat of the strong scaling test performed here is that for the MPI implementation, if the number of processors assigned to each side of the image does not evenly divide the number of pixels per side of the image, then the additional rows and columns are truncated. For this image and the sizes tested, there are at most four rows or columns truncated, so the reduction in work size is minimal. No truncation is performed for the shared memory OpenMP implementation.

The tests were run on the Stampede supercomputer at the University of Texas. Stampede is a cluster featuring 16 processors (eight cores per each of two Xeon E5-2680 Sandy Bridge CPUs) and 32 GB DDR3 RAM per node, connected with a high-speed InfiniBand network. For tests on more than 16 processors, a job for four nodes was submitted, and for 16 or fewer processors, a separate job requesting a single node was submitted. Each test was performed with 100 iterations of the algorithm. Larger scale results are not reported here as the time to execute production queue jobs was reported as nearly 10 days when they were submitted. However, this document will be updated when the results are available.

All code was compiled on Stampede using \({\tt gcc}\) version 4.4.7 for OpenMP code and \({\tt mpicc}\) version 15.0.2 for MPI code. The following compiler flags were used for all timed test runs: \(<code>-lm -lrt -O3 -std=gnu99</code>\). The \(<code> -fopenmp</code>\) flag was also added to the OpenMP compilations. The MPI code was compiled against the MVAPICH2 library version 2.1, which is optimized to run on the high-performance network that undergirds Stampede.

Results

The weak scaling results for the small and large size tests are shown in Figures \ref{fig:weaksmall} and \ref{fig:weaklarge}. For the weak scaling results, the OpenMP implementation performs better than simple linear scaling of the single processor performance; however, it is not as quick as either of the MPI implementations. For the small work size, there is an easily seen communication penalty when the number of processors necessitates the use of more than one node. Once that communication penalty is incurred, though, the slope of the scaling is similar to the slope of the scaling on a single node.

\label{fig:weaksmall}Weak scaling results for the small work size. The x-axis indicates the number of processors used, and the y-axis indicates the time in seconds per iteration of the computation. The MPI Interleaved data points lie on top of the MPI data, so the latter are not visible here.

\label{fig:weaklarge}Weak scaling results for the large work size. The x-axis indicates the number of processors used, and the y-axis indicates the time in seconds per iteration of the computation.

The large weak scaling test shows a similar gap in performance between the OpenMP and MPI implementations. The large penalty seen in the small test due to internode communication is largely eliminated. This, and the lack of any noticeable improvement in timings between the merely initialized and fully interleaved variations of the MPI code, indicates that keeping the processors busy with larger subproblem sizes can effectively hide the latency of the network messages.

Future work on this project will include reporting on larger problem sizes and processor counts. A pure non-blocking implementation was also written, but not tested, and may provide interesting results given the latency result noted above. A hybrid OpenMP + MPI implementation could also be investigated to explore the gap between the performance of the shared and distributed memory versions.

All code for this project is available in a public repository at (Prince 2017).

\label{fig:strong}Strong scaling results for the UO test image. The x-axis indicates the number of processors used, and the y-axis (plotted logarithmically) indicates the time in seconds per iteration of the computation.

Acknowledgements

I thank Georg Stadler and Bill Bao for their initial feedback on my project proposal and for literature recommendations. I also thank Federica Bianco for supplying images from the CUSP Urban Observatory for use in this paper.

References

  1. Leonid I. Rudin, Stanley Osher, Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena 60, 259–268 Elsevier BV, 1992. Link

  2. Antonin Chambolle, Vicent Caselles, Matteo Novaga, Daniel Cremers, Thomas Pock. An introduction to Total Variation for Image Analysis. (2009). Link

  3. opencv: Open Source Computer Vision Library. OpenCV, 2017. Link

  4. Antonin Chambolle, Thomas Pock. A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging. Journal of Mathematical Imaging and Vision 40, 120–145 Springer Nature, 2010. Link

  5. Alexander Mordvintsev. ROF and TV-L1 denoising with Primal-Dual algorithm. (2013). Link

  6. John Burkardt. ppma_io.c. (1998). Link

  7. Eric Jones, Travis Oliphant, Pearu Peterson, others. SciPy: Open source scientific tools for Python. (2001–). Link

  8. Chris Prince. hpc17: Coursework for NYU HPC class. (2017). Link

[Someone else is editing this]

You are editing this file