.
FIGURE 4 ( a) LED-based photoacoustic images of subcutaneous
tumors in mouse are taken and the several cross sections were captures
in direction from tail towards head. Low number of frame averages:
Images captured with LED illumination by the Acoustic X system, where
the LEDs had PRF of 4 kHz and 128 frames were averaged. High number of
frame averages: The PA image was captured at the PRF of 4kHz with 25,600
averaging at the same cross-section as that of low-frame average image.
U-Net outcomes: Image predicted by the U-Net deep learning technique
where it improves the SNR of the corresponding low number of frame
averaged image. Photoacoustics images across different subcutaneous
tumor cross-sections for three mice. (b-d): The ultrasound images of
three different tumors are shown where each tumor is demarcated with
yellow-dotted circle. The tumors are placed at different distances from
the transducer and LED array to achieve different light fluence
deposited on the tumor. (e-g): Images obtained with low number of frame
averages. (h-j): Images predicted by the U-Net deep learning technique
where it improves the SNR of the above corresponding (Fig. 4 (a-c)) low
number of frame averaged images. (k-m): PA images of the corresponding
cross-sections as that in (Fig. 3 (e-g)) with high number of frame
averages.
Illumination sources such as the LED arrays are becoming popular sources
for PA imaging compared to the lasers due to their light weight, small
footprint, low cost, longer lifetime, easier portability, etc. The low
fluence generated by the LEDs is a main concern that is currently
circumvented by acquiring high number of frames, i.e.., averaging
~25,600 frames that corresponds to 0.15 Hz. Reducing the
number of frames to be averaged, though increases the acquisition speed,
significantly decreases the SNR of the images making noise-free
real-time imaging arduous. Here we developed a deep learning-based
simple U-Net framework capable of providing a denoised image in
real-time. Many deep-learning models have already been devised with the
laser (higher energy output than LEDs) to handle the problem of
under-sampling due to the limited-view dataset [32, 41], but most of
the studies tested their models on in vitro phantoms. Furthermore, there
are no in vivo studies reported where training samples were drawn
from the different in-class distributions or from out-class training
data. A mentionable fact is that our training and testing set are not
drawn from a similar dataset as implemented in other studies
[54-56], but rather metal plates and graphite powder-based phantoms
were used in the training phase while the in vivo mouse tumor
images were utilized as the testing dataset.
Recently, few research studies demonstrated the concept of compensating
the low-fluence problem of the LED-based systems with deep learning
[54-57]. In the study by Hariri et al [57] and Anas et. al.
[54], the complexity of the network necessitated the use of huge
training data to converge. The U-Net architecture presented in this
study required a very low amount of training dataset (it can have as low
as 800 training data sets even though we performed various data
augmentations like rotation, pixel shifting, horizontal flip, zooming,
shear etc.). Capturing the training data can also be easily achieved as
we have used only one LED system for both the training images and labels
without even modifying any setup parameters because the low and high
frame averaging is done by the Acoustic X software and its internal
hardware. Compared to Anas et al [54], our U-Net frame gain rate
(which is 25,600 / 128 = 200) is on the higher side and the training
data volume is comparatively low. In other studies, by Manwar et. al.
[56] and Singh et. al. [55] the testing was done with only the
in-class samples which might limit the generalizability of the network
when applied in different and unknown in vivo samples. Our model,
on the other hand, is simple and requires less amount of easily
replicable training data and was tested with both the in-class and
out-of-class samples at different noise levels to achieve a more
generalized framework. When we compared the published result of Vu et.
al. [38] with our model, we found that our model provides
significantly high SNR and contrast to noise ratio (CNR). The data
augmentation in the form of various spatial deformation of the training
images namely rotation, sheer, zoom, flip, height and width shift
generated more different data with which the network was trained. The
invariance to multiple geometric transformations might make the U-Net
robust and less training data hungry [50]. The contracting path of
the U-Net architecture is designed for achieving higher level features
compromising locality while the up-sampling path integrates those poorly
localized high-level features into the original image and regains the
lost information in each of the max-pooling steps by the skip
connections of corresponding hierarchical layers.