KEYWORDS
Photoacoustics; LED Acoustic-X; Non-learning noise removal methods;
U-Net; Signal-to-noise ratio. Peak signal-to-noise ratio;
Contrast-to-noise ratio; Frequency spectrum analysis
1 | INTRODUCTION
The advances in laser and ultrasound technology enabled the development
of photoacoustic (PA) imaging, a concept coined by Bell [1], into a
full-fledged non-invasive, non-ionizing and label-free [2] hybrid
imaging modality. PA imaging, armed with good optical contrast and high
spatial resolution, has a potential for clinical use as it facilitates
functional and physiological imaging at high penetration depth [3].
Photoacoustic signals are generated when the laser pulse width satisfies
the thermal and stress relaxation confinement condition and typical
pulse widths are in the nanosecond range. Traditionally, tissues are
irradiated with nanosecond pulse laser illumination at energies in the
range of 20-35 mJ/cm2 at pre-determined wavelengths
based on the optical absorption properties of the sample [4-7]. A
transducer captures the generated photoacoustic signal that is then
reconstructed to form an image representative of the optical absorption
properties of the object. Traditionally, Nd-YAG pumped optical
parametric oscillator (OPO) lasers are used for photoacoustic imaging.
These are often expensive and until recently, were not mobile. In the
context of cost-effectivity, portability, and high pulse repetition
frequency (PRF), recently, light emitting diode (LED)-based illumination
systems [8-16] are emerging as a compelling solution for clinical
and preclinical use. The LED arrays, even with the above-mentioned
advantages, cannot provide a high fluence output similar to that
delivered by a laser. The low energy (in the order of nanojoules)
generated by the LED arrays is generally compensated by the high frame
averaging as the arrays can be operated at a high PRF of 1 - 4 kHz
[17, 18]. Increasing the number of frames will lead to an increase
in acquisition time, sometimes on the order of several seconds to
minutes for obtaining a 2D image. Understanding the functional dynamics
of biological events in vivo requires real time imaging
capabilities and should not be limited by the time required for the
image acquisition process. Therefore, there is a need to significantly
improve the acquisition speed in PA imaging systems without compromising
the signal-to-noise ratio (SNR).
Several recent developments, including advances in array transducers,
light delivery systems and deep-learning strategies have occurred in the
field of PA imaging [19]. Specifically in the realm of machine
learning strategies, many deep-learning network models are being
investigated to improve PA image reconstruction methodologies
[20-23]. Several important studies, including recent reviews
[24-27] in deep learning with laser-based systems, addressed the
problem of under-sampled data sparsity due to the limited number of
detectors. While many studies demonstrated the results on numerically
simulated data and in vitro phantoms [28-35]; Only a handful
of studies tested their deep-learning models on in vivo data. In
cases where in vivo data was used, both the training and testing
data sets were drawn from similar in-class sample data (for example,
training and testing were both done using similar phantoms or in
vivo vasculature datasets) [36-40]. The above approach ultimately
puts a limitation on the generalizability of the deep network, and there
is also a huge burden of high training sample requirement [41, 42].
Another mentionable fact is that all the above studies mainly
concentrated on laser illumination to alleviate the problem of limited
view under-sampled data and bandwidth correction but did not utilize the
LED-based illumination.
Currently, the main roadblock for LED-based PAI systems to make clinical
impact is its low signal-to-noise ratio due to low fluence. In this
study, we address this roadblock with deep-learning strategy – i.e., by
implementing a U-Net architecture with data augmentation that is trained
on few datasets, can be employed on spatially invariant out-class
samples in real-time and is agnostic to various noise levels.
Specifically, we trained our U-Net architecture with input images
(averaging 128 frames, i.e., low frame average image) and labels
(averaging 25,600 frames, i.e., high frame average image) generated with
a LED-based photoacoustic system Acoustic X.
We performed in vivo mouse subcutaneous tumor imaging with the
Acoustic X system and quantitatively compared the results on various
image metrics between our U-Net architecture and other traditional noise
reduction filters such as Median, Wiener, Total Variation (TV) and
Savitzky-Golay (SG) filters. Next, we did a comparative study of the
low, high no. of frame averaging and U-Net algorithms in frequency
domain regarding the magnitude spectrum and concluded the reason of the
blurring effects of the U-Net outcomes. Last, we also performed a study
to verify the noise distribution-type invariancy of the network with
Gaussian and Salt & Pepper (S&P) noise. Though U-Net is unable to
handle the S&P noise, it would not possibly create any major hurdle in
real life clinics, as most of the electronic noise follow Gaussian noise
distribution, and our U-Net is well adapted to the Gaussian noise. In
summary, the simple U-Net architecture can provide high SNR
photoacoustic images by compensating for the low fluence in LED-based
systems and this might yield a closer step towards the clinical
translation (or from bench to bedside) of PA imaging even though the
network is susceptible to brightness contrast illusions and produces a
blurry image comparative to the ground truth.
2 | EXPERIMENTAL METHODS
2.1 | Imaging System
The LED-based photoacoustic system, Acoustic X, by Cyberdyne Inc was
used for capturing all the images reported in this study. The array
transducer used for imaging had a central frequency of 7 MHz (128
elements, −6 dB bandwidth is 80%). The LEDs connected at around 43°
angular position to the transducer produced 850 nm wavelength
irradiation with 30 nanosecond pulse width. A PRF of 4 kHz was utilized
in the study. Gain varied between 60 to 67 dB, depending on in
vitro or in vivo samples being imaged, and was kept constant for
a given experiment. The dynamic range was set to 19 dB for the in
vitro - phantoms and 19 - 25 dB for the mouse tumor models
respectively. The low-frame rate was 30 Hz (images generated with 128
image frames being averaged) and 0.15 Hz (images generated with 25600
frames averaging) was considered as the high-frame rate respectively.
After capturing the data from the LED system, the beamformed images were
further processed with custom-written MATLAB (R2021b) code for cropping
and resizing to images to a 256 x 256-pixel image. All the image metrics
were calculated in MATLAB.
2.2 | In vitro tissue mimicking phantoms
Several cylindrical graphite powder phantoms in titanium oxide
(TiO2; Sigma Aldrich) scattering medium were prepared
inside a 6 well plate dish (Falcon™ 353046, Fisher Scientific).
TiO2 (0.1%) was first mixed with hot (43 – 45 °C)
gelatine solution (8% w/v) and then poured into the wells after the
temperature reached around 35 °C along with a cylindrical tube adjusted
at the centre to facilitate graphite inclusion in the centre. A portion
of the TiO2 gelatine solution was mixed with graphite
powder at different concentrations of 0.01%, 0.025%, 0.05% and 0.1%
respectively, and poured into the hole created in the phantom.
2.3 | In vivo murine tumor models
Mice with subcutaneous xenograft tumors were used in this study to
evaluate our algorithm. Human pancreatic adenocarcinoma cells (AsPC-1)
were suspended in a mixture of Matrigel (BD Bioscience) and phosphate
buffered saline (1:1 v/v) at a density of 5 million cells per 100 μL and
injected subcutaneously in nude mice. At around 55-60 days post
inoculation, the tumors grew to approximately 300 - 400
mm3 volume with a heterogenous tumor microenvironment
containing both vascular and avascular regions. At the time of imaging,
the mice were anesthetized with 2% isoflurane and rested on a
custom-made platform in a water bath with the head raised above the
water level. During imaging, the isoflurane was reduced to 1-1.5% to
maintain general anaesthesia. A total of 8 mice were imaged in this
study. Three frames ~ 4-5 mm apart were acquired on each
mouse. All procedures were done in compliance with the Institutional
Animal Care and Use Committee of Tufts University.
2.4 | Traditional denoising algorithms
Our network’s performance was compared to several established
non-machine learning de-noising filters. We input our low number of
frame averaged images into the Wiener [46], Median [47], TV
[48], and SG [49] filters. We used pixel-wise adaptive Wiener
filtering with a neighborhood of 5 x 5 window for the local mean and
standard deviation estimation. We estimated the additive noise and then
passed that noise information into the Wiener filtering using the MATLAB
“wiener2” command. The median filter was achieved via the MATLAB
in-built “medfilt2” command. The primal-dual algorithm was utilized
for the TV-L1 denoising where the regularization parameter (λ) was fixed
at 5 and the number of iterations was set to 200. The parameter values
are generally set by trial-and-error method and the lower value of λ
denotes more aggressive de-noising which might produce a more smoothed
image. The SG filtering was implemented in MATLAB with the in-built
cubic “sgolayfilt” command where the polynomial order was set to 3 and
the frame length was set to 15.
2.5 | Deep learning architecture
Following the work on biomedical image segmentation [50], we built a
U-Net deep learning network to predict high-frame averaging spatial
images from the low number of frame-averaged ones. The U-Net code (main
model, training, data augmentation and testing) was written in a deep
learning API, Keras that runs on top of an end-to-end open-source
platform, TensorFlow. Google Collab platform with Python3 Google compute
engine backend GPU (Tesla K80, CUDA Version: 11.2, Intel(R) Xeon(R) CPU
@ 2.00GHz) has been utilized for this end. The fully convolutional
network U-Net comes with an encoding and a decoding part while the lost
part during the down-conversion of the image flows through the skip
connection (yellow arrow) to the corresponding up-converted layer as
shown in Fig. 1 and this concatenation of feature maps give localization
information. The input image was set to be 256 x 256 pixels. Each layer
of the network has two multichannel feature maps (deep blue box in Fig.
1) with varying numbers of feature spaces (starting from 64). The
convolution operation was applied with kernel sizes of 3 x 3 (green
arrow) along with the ‘same’ padding so that we do not miss the corners
of the image. It was then passed through a non-linear (piecewise linear)
rectified linear unit (ReLU) activation function. After the
convolution and the non-linearization steps, the max-pooling was done in
a 2 x 2 stride (orange down arrow) which reduces the pixel dimension to
half of the original size and again the same feature space convolution
and pooling operation are repeated three times. From the bottommost
layer, sometimes known as the bottleneck layer, up-sampling (black up
arrow) is executed with the same stride as that of the max-pooling and
the feature space convolution operations are implemented exactly similar
to the contracting pathway at the corresponding hierarchy. The only
extra operation is the concatenation as denoted by the light blue box in
Fig. 1. At the end of the contracting path, there are two dropout
operations denoted by the brown arrows. The last convolution layer is
passed through the sigmoid activation function which provides
the predicted denoised outputs. Mean squared error (MSE) was
used as the loss function and the replacement optimization algorithm –Adam with 1e-4 as the initial learning rate was applied.