Figure 3: Sequence-HAN. The network builds up a series representation as a weighted combination of sequence representations, where each sequence representation is constructed from a weighted combination of slice representations.
Experiments
In this work we presented two proof-of-principle experiments to demonstrate the general utility of treating clinical neuroimaging data hierarchically - one using sequence-HAN to determine the most informative sequence(s) for this dataset, and a second using patch- HAN to perform more granular abnormality detection using this sequence. For the first experiment we included axial T2-weighted images of size (512 x 512) x 23 slices, axial diffusion-weighted images (DWI) of size (256 x 256) x 7 slices, and axial localizer images of size (256 x 256) x 7 slices, as these images were common to nearly all clinical examinations. We split this dataset into training/validation/test sets of sizes 800, 200, 200, respectively, where each instance contains a stack of slices for each of the three sequences. Because this is real-world hospital data, multiple series for patients are common and images from these separate visits are likely to be highly correlated. To avoid this form of data leakage we preformed the split at the level of patients so that no patient that appeared in the training set appeared in the validation or test set. The images were then minimally pre-processed, with each pixel normalized to the slice mean, with unit variance. No skull-stripping or co-registration was performed reducing the complexity of the process and computational burden. For all experiments, the attention context vectors, matrices, and biases were initialized from a zero-centred normal distribution with variance σ = 0.05. In all experiments the CNNs were 18-layer ResNet networks, warm started using values pre-trained on ImageNet (Deng et al., 2009). The networks were trained for 15 epochs (with early stopping) using ADAM (Kingma and Ba, 2014) with initial learning rate 1e-4, decayed by 0.97 after each epoch, on a single NVIDIA GTX 2080ti 11 GB GPU. Minimal hyperparameter tun- ing (in this case learning rate and LSTM dimensionality) was performed on the validation set, and the model with the best classification accuracy was used to determine the final model performance on the balanced independent test set, For the results that we present, all LSTM hidden units had a dimension of 512, and the patch size for patch-HAN was 150 x 150. To benchmark patch-HAN, we train a baseline network which puts equal weighting on each slice and patch, and a second which processed each slice and patch independently and performed sum pooling (i.e., a fully convolutional model with no recurrent network). For the inter-sequence HAN we trained a convolutional model which performed sum-pooling over slices and sequences, as well as a recurrent-based network which put equal weighting on each sequence and slice.
Results
The classification performance of sequence-HAN, along with that of the two multi-sequence baseline architectures, appears in Table 1. Our model outperforms all simpler multi- sequence networks, achieving a classification accuracy of 95.5%, illustrating the value of treating different sequences and slices hierarchically. By analysing the attention weights of our model, the most informative sequence and/or slice can be determined for a particular study. Figure 4 shows the sequence weights for two examples from the test set, as well as the weights of each slice for the most informative sequence (in both cases T2-weighted). These examples were representative; in general, the model put most weight on the T2-weighted sequence for classification, with average scores of 0.81, 0.13, and 0.06 across the whole test set for T2-weighted, DWI, and localizer sequences respectively. Good agreement between the slice attention weights and the spatial distribution of the abnormalities was demonstrated.