.
FIGURE 4 ( a) LED-based photoacoustic images of subcutaneous tumors in mouse are taken and the several cross sections were captures in direction from tail towards head. Low number of frame averages: Images captured with LED illumination by the Acoustic X system, where the LEDs had PRF of 4 kHz and 128 frames were averaged. High number of frame averages: The PA image was captured at the PRF of 4kHz with 25,600 averaging at the same cross-section as that of low-frame average image. U-Net outcomes: Image predicted by the U-Net deep learning technique where it improves the SNR of the corresponding low number of frame averaged image. Photoacoustics images across different subcutaneous tumor cross-sections for three mice. (b-d): The ultrasound images of three different tumors are shown where each tumor is demarcated with yellow-dotted circle. The tumors are placed at different distances from the transducer and LED array to achieve different light fluence deposited on the tumor. (e-g): Images obtained with low number of frame averages. (h-j): Images predicted by the U-Net deep learning technique where it improves the SNR of the above corresponding (Fig. 4 (a-c)) low number of frame averaged images. (k-m): PA images of the corresponding cross-sections as that in (Fig. 3 (e-g)) with high number of frame averages.
Illumination sources such as the LED arrays are becoming popular sources for PA imaging compared to the lasers due to their light weight, small footprint, low cost, longer lifetime, easier portability, etc. The low fluence generated by the LEDs is a main concern that is currently circumvented by acquiring high number of frames, i.e.., averaging ~25,600 frames that corresponds to 0.15 Hz. Reducing the number of frames to be averaged, though increases the acquisition speed, significantly decreases the SNR of the images making noise-free real-time imaging arduous. Here we developed a deep learning-based simple U-Net framework capable of providing a denoised image in real-time. Many deep-learning models have already been devised with the laser (higher energy output than LEDs) to handle the problem of under-sampling due to the limited-view dataset [32, 41], but most of the studies tested their models on in vitro phantoms. Furthermore, there are no in vivo studies reported where training samples were drawn from the different in-class distributions or from out-class training data. A mentionable fact is that our training and testing set are not drawn from a similar dataset as implemented in other studies [54-56], but rather metal plates and graphite powder-based phantoms were used in the training phase while the in vivo mouse tumor images were utilized as the testing dataset.
Recently, few research studies demonstrated the concept of compensating the low-fluence problem of the LED-based systems with deep learning [54-57]. In the study by Hariri et al [57] and Anas et. al. [54], the complexity of the network necessitated the use of huge training data to converge. The U-Net architecture presented in this study required a very low amount of training dataset (it can have as low as 800 training data sets even though we performed various data augmentations like rotation, pixel shifting, horizontal flip, zooming, shear etc.). Capturing the training data can also be easily achieved as we have used only one LED system for both the training images and labels without even modifying any setup parameters because the low and high frame averaging is done by the Acoustic X software and its internal hardware. Compared to Anas et al [54], our U-Net frame gain rate (which is 25,600 / 128 = 200) is on the higher side and the training data volume is comparatively low. In other studies, by Manwar et. al. [56] and Singh et. al. [55] the testing was done with only the in-class samples which might limit the generalizability of the network when applied in different and unknown in vivo samples. Our model, on the other hand, is simple and requires less amount of easily replicable training data and was tested with both the in-class and out-of-class samples at different noise levels to achieve a more generalized framework. When we compared the published result of Vu et. al. [38] with our model, we found that our model provides significantly high SNR and contrast to noise ratio (CNR). The data augmentation in the form of various spatial deformation of the training images namely rotation, sheer, zoom, flip, height and width shift generated more different data with which the network was trained. The invariance to multiple geometric transformations might make the U-Net robust and less training data hungry [50]. The contracting path of the U-Net architecture is designed for achieving higher level features compromising locality while the up-sampling path integrates those poorly localized high-level features into the original image and regains the lost information in each of the max-pooling steps by the skip connections of corresponding hierarchical layers.