Figure 3: Sequence-HAN. The network builds up a series
representation as a weighted combination of sequence representations,
where each sequence representation is constructed from a weighted
combination of slice representations.
Experiments
In this work we presented two proof-of-principle experiments to
demonstrate the general
utility of treating clinical neuroimaging data hierarchically - one
using sequence-HAN to determine the most informative sequence(s) for
this dataset, and a second using patch-
HAN to perform more granular abnormality detection using this sequence.
For the first
experiment we included axial T2-weighted images of size
(512 x 512) x 23 slices, axial
diffusion-weighted images (DWI) of size (256 x 256) x 7 slices, and
axial localizer images of
size (256 x 256) x 7 slices, as these images were common to nearly all
clinical examinations.
We split this dataset into training/validation/test sets of sizes 800,
200, 200, respectively,
where each instance contains a stack of slices for each of the three
sequences. Because this is real-world hospital data, multiple series for
patients are common and images from these separate visits are likely to
be highly correlated. To avoid this form of data leakage we
preformed the split at the level of patients so that no patient that
appeared in the training
set appeared in the validation or test set. The images were then
minimally pre-processed,
with each pixel normalized to the slice mean, with unit variance. No
skull-stripping or
co-registration was performed reducing the complexity of the process and
computational
burden. For all experiments, the attention context vectors, matrices,
and biases were initialized from a zero-centred normal distribution with
variance σ = 0.05. In all experiments
the CNNs were 18-layer ResNet networks, warm started using values
pre-trained on ImageNet (Deng et al., 2009). The networks were trained
for 15 epochs (with early stopping)
using ADAM (Kingma and Ba, 2014) with initial learning rate 1e-4,
decayed by 0.97 after
each epoch, on a single NVIDIA GTX 2080ti 11 GB GPU. Minimal
hyperparameter tun-
ing (in this case learning rate and LSTM dimensionality) was performed
on the validation
set, and the model with the best classification accuracy was used to
determine the final
model performance on the balanced independent test set, For the results
that we present,
all LSTM hidden units had a dimension of 512, and the patch size for
patch-HAN was 150 x
150. To benchmark patch-HAN, we train a baseline network which puts
equal weighting on
each slice and patch, and a second which processed each slice and patch
independently and performed sum pooling (i.e., a fully convolutional
model with no recurrent network). For the inter-sequence HAN we trained
a convolutional model which performed sum-pooling over slices and
sequences, as well as a recurrent-based network which put equal
weighting on each sequence and slice.
Results
The classification performance of sequence-HAN, along with that of the
two multi-sequence
baseline architectures, appears in Table 1. Our model outperforms all
simpler multi-
sequence networks, achieving a classification accuracy of 95.5%,
illustrating the value of
treating different sequences and slices hierarchically. By analysing the
attention weights of
our model, the most informative sequence and/or slice can be determined
for a particular
study. Figure 4 shows the sequence weights for two examples from the
test set, as well as the weights of each slice for the most informative
sequence (in both cases T2-weighted). These examples
were representative; in general, the model put most weight on the
T2-weighted sequence for classification, with average
scores of 0.81, 0.13, and 0.06 across the whole test set for
T2-weighted, DWI, and localizer sequences respectively.
Good agreement between the slice attention weights and the spatial
distribution of the abnormalities was demonstrated.