KEYWORDS
Photoacoustics; LED Acoustic-X; Non-learning noise removal methods; U-Net; Signal-to-noise ratio. Peak signal-to-noise ratio; Contrast-to-noise ratio; Frequency spectrum analysis
1 | INTRODUCTION
The advances in laser and ultrasound technology enabled the development of photoacoustic (PA) imaging, a concept coined by Bell [1], into a full-fledged non-invasive, non-ionizing and label-free [2] hybrid imaging modality. PA imaging, armed with good optical contrast and high spatial resolution, has a potential for clinical use as it facilitates functional and physiological imaging at high penetration depth [3]. Photoacoustic signals are generated when the laser pulse width satisfies the thermal and stress relaxation confinement condition and typical pulse widths are in the nanosecond range. Traditionally, tissues are irradiated with nanosecond pulse laser illumination at energies in the range of 20-35 mJ/cm2 at pre-determined wavelengths based on the optical absorption properties of the sample [4-7]. A transducer captures the generated photoacoustic signal that is then reconstructed to form an image representative of the optical absorption properties of the object. Traditionally, Nd-YAG pumped optical parametric oscillator (OPO) lasers are used for photoacoustic imaging. These are often expensive and until recently, were not mobile. In the context of cost-effectivity, portability, and high pulse repetition frequency (PRF), recently, light emitting diode (LED)-based illumination systems [8-16] are emerging as a compelling solution for clinical and preclinical use. The LED arrays, even with the above-mentioned advantages, cannot provide a high fluence output similar to that delivered by a laser. The low energy (in the order of nanojoules) generated by the LED arrays is generally compensated by the high frame averaging as the arrays can be operated at a high PRF of 1 - 4 kHz [17, 18]. Increasing the number of frames will lead to an increase in acquisition time, sometimes on the order of several seconds to minutes for obtaining a 2D image. Understanding the functional dynamics of biological events in vivo requires real time imaging capabilities and should not be limited by the time required for the image acquisition process. Therefore, there is a need to significantly improve the acquisition speed in PA imaging systems without compromising the signal-to-noise ratio (SNR).
Several recent developments, including advances in array transducers, light delivery systems and deep-learning strategies have occurred in the field of PA imaging [19]. Specifically in the realm of machine learning strategies, many deep-learning network models are being investigated to improve PA image reconstruction methodologies [20-23]. Several important studies, including recent reviews [24-27] in deep learning with laser-based systems, addressed the problem of under-sampled data sparsity due to the limited number of detectors. While many studies demonstrated the results on numerically simulated data and in vitro phantoms [28-35]; Only a handful of studies tested their deep-learning models on in vivo data. In cases where in vivo data was used, both the training and testing data sets were drawn from similar in-class sample data (for example, training and testing were both done using similar phantoms or in vivo vasculature datasets) [36-40]. The above approach ultimately puts a limitation on the generalizability of the deep network, and there is also a huge burden of high training sample requirement [41, 42]. Another mentionable fact is that all the above studies mainly concentrated on laser illumination to alleviate the problem of limited view under-sampled data and bandwidth correction but did not utilize the LED-based illumination.
Currently, the main roadblock for LED-based PAI systems to make clinical impact is its low signal-to-noise ratio due to low fluence. In this study, we address this roadblock with deep-learning strategy – i.e., by implementing a U-Net architecture with data augmentation that is trained on few datasets, can be employed on spatially invariant out-class samples in real-time and is agnostic to various noise levels. Specifically, we trained our U-Net architecture with input images (averaging 128 frames, i.e., low frame average image) and labels (averaging 25,600 frames, i.e., high frame average image) generated with a LED-based photoacoustic system Acoustic X.
We performed in vivo mouse subcutaneous tumor imaging with the Acoustic X system and quantitatively compared the results on various image metrics between our U-Net architecture and other traditional noise reduction filters such as Median, Wiener, Total Variation (TV) and Savitzky-Golay (SG) filters. Next, we did a comparative study of the low, high no. of frame averaging and U-Net algorithms in frequency domain regarding the magnitude spectrum and concluded the reason of the blurring effects of the U-Net outcomes. Last, we also performed a study to verify the noise distribution-type invariancy of the network with Gaussian and Salt & Pepper (S&P) noise. Though U-Net is unable to handle the S&P noise, it would not possibly create any major hurdle in real life clinics, as most of the electronic noise follow Gaussian noise distribution, and our U-Net is well adapted to the Gaussian noise. In summary, the simple U-Net architecture can provide high SNR photoacoustic images by compensating for the low fluence in LED-based systems and this might yield a closer step towards the clinical translation (or from bench to bedside) of PA imaging even though the network is susceptible to brightness contrast illusions and produces a blurry image comparative to the ground truth.
2 | EXPERIMENTAL METHODS
2.1 | Imaging System
The LED-based photoacoustic system, Acoustic X, by Cyberdyne Inc was used for capturing all the images reported in this study. The array transducer used for imaging had a central frequency of 7 MHz (128 elements, −6 dB bandwidth is 80%). The LEDs connected at around 43° angular position to the transducer produced 850 nm wavelength irradiation with 30 nanosecond pulse width. A PRF of 4 kHz was utilized in the study. Gain varied between 60 to 67 dB, depending on in vitro or in vivo samples being imaged, and was kept constant for a given experiment. The dynamic range was set to 19 dB for the in vitro - phantoms and 19 - 25 dB for the mouse tumor models respectively. The low-frame rate was 30 Hz (images generated with 128 image frames being averaged) and 0.15 Hz (images generated with 25600 frames averaging) was considered as the high-frame rate respectively. After capturing the data from the LED system, the beamformed images were further processed with custom-written MATLAB (R2021b) code for cropping and resizing to images to a 256 x 256-pixel image. All the image metrics were calculated in MATLAB.
2.2 | In vitro tissue mimicking phantoms
Several cylindrical graphite powder phantoms in titanium oxide (TiO2; Sigma Aldrich) scattering medium were prepared inside a 6 well plate dish (Falcon™ 353046, Fisher Scientific). TiO2 (0.1%) was first mixed with hot (43 – 45 °C) gelatine solution (8% w/v) and then poured into the wells after the temperature reached around 35 °C along with a cylindrical tube adjusted at the centre to facilitate graphite inclusion in the centre. A portion of the TiO2 gelatine solution was mixed with graphite powder at different concentrations of 0.01%, 0.025%, 0.05% and 0.1% respectively, and poured into the hole created in the phantom.
2.3 | In vivo murine tumor models
Mice with subcutaneous xenograft tumors were used in this study to evaluate our algorithm. Human pancreatic adenocarcinoma cells (AsPC-1) were suspended in a mixture of Matrigel (BD Bioscience) and phosphate buffered saline (1:1 v/v) at a density of 5 million cells per 100 μL and injected subcutaneously in nude mice. At around 55-60 days post inoculation, the tumors grew to approximately 300 - 400 mm3 volume with a heterogenous tumor microenvironment containing both vascular and avascular regions. At the time of imaging, the mice were anesthetized with 2% isoflurane and rested on a custom-made platform in a water bath with the head raised above the water level. During imaging, the isoflurane was reduced to 1-1.5% to maintain general anaesthesia. A total of 8 mice were imaged in this study. Three frames ~ 4-5 mm apart were acquired on each mouse. All procedures were done in compliance with the Institutional Animal Care and Use Committee of Tufts University.
2.4 | Traditional denoising algorithms
Our network’s performance was compared to several established non-machine learning de-noising filters. We input our low number of frame averaged images into the Wiener [46], Median [47], TV [48], and SG [49] filters. We used pixel-wise adaptive Wiener filtering with a neighborhood of 5 x 5 window for the local mean and standard deviation estimation. We estimated the additive noise and then passed that noise information into the Wiener filtering using the MATLAB “wiener2” command. The median filter was achieved via the MATLAB in-built “medfilt2” command. The primal-dual algorithm was utilized for the TV-L1 denoising where the regularization parameter (λ) was fixed at 5 and the number of iterations was set to 200. The parameter values are generally set by trial-and-error method and the lower value of λ denotes more aggressive de-noising which might produce a more smoothed image. The SG filtering was implemented in MATLAB with the in-built cubic “sgolayfilt” command where the polynomial order was set to 3 and the frame length was set to 15.
2.5 | Deep learning architecture
Following the work on biomedical image segmentation [50], we built a U-Net deep learning network to predict high-frame averaging spatial images from the low number of frame-averaged ones. The U-Net code (main model, training, data augmentation and testing) was written in a deep learning API, Keras that runs on top of an end-to-end open-source platform, TensorFlow. Google Collab platform with Python3 Google compute engine backend GPU (Tesla K80, CUDA Version: 11.2, Intel(R) Xeon(R) CPU @ 2.00GHz) has been utilized for this end. The fully convolutional network U-Net comes with an encoding and a decoding part while the lost part during the down-conversion of the image flows through the skip connection (yellow arrow) to the corresponding up-converted layer as shown in Fig. 1 and this concatenation of feature maps give localization information. The input image was set to be 256 x 256 pixels. Each layer of the network has two multichannel feature maps (deep blue box in Fig. 1) with varying numbers of feature spaces (starting from 64). The convolution operation was applied with kernel sizes of 3 x 3 (green arrow) along with the ‘same’ padding so that we do not miss the corners of the image. It was then passed through a non-linear (piecewise linear) rectified linear unit (ReLU) activation function. After the convolution and the non-linearization steps, the max-pooling was done in a 2 x 2 stride (orange down arrow) which reduces the pixel dimension to half of the original size and again the same feature space convolution and pooling operation are repeated three times. From the bottommost layer, sometimes known as the bottleneck layer, up-sampling (black up arrow) is executed with the same stride as that of the max-pooling and the feature space convolution operations are implemented exactly similar to the contracting pathway at the corresponding hierarchy. The only extra operation is the concatenation as denoted by the light blue box in Fig. 1. At the end of the contracting path, there are two dropout operations denoted by the brown arrows. The last convolution layer is passed through the sigmoid activation function which provides the predicted denoised outputs. Mean squared error (MSE) was used as the loss function and the replacement optimization algorithm –Adam with 1e-4 as the initial learning rate was applied.