Abstract
Clinical neuroimaging data is naturally hierarchical. Different magnetic
resonance imaging (MRI) sequences within a series, different slices
covering the head, and different regions within each slice all confer
different information. In this work we present a hierarchical attention
network for abnormality detection using MRI scans obtained in a clinical
hospital setting. The proposed network is suitable for non-volumetric
data (i.e., stacks of high-resolution MRI slices), and can be trained
from binary examination-level labels. We show that this hierarchical
approach leads to improved classification, while providing
interpretability through either coarse inter- and intra-slice
abnormality localisation, or giving importance scores for different
slices and sequences, making our model suitable for use as an automated
triaging system in radiology departments.
Introduction
Deep learning-based computer vision systems hold promise for
automatically triaging patients in hospital radiology departments. In
the UK, for example, with a 4.6% increase in brain magnetic resonance
imaging (MRI) scans performed in the last 12 months alone (NHS, 2019),
and with an increase in the time taken to report out-patient MRI scans
every year since 2012, an automated triage mechanism to identify
abnormalities at the time of imaging, and thereby allow prioritised scan
reporting, is urgently needed. Such a mechanism would potentially allow
early intervention to improve short- and long-term clinical outcomes.
Assuming that a first-generation system will operate by assisting
real-time radiologist review, any prospective model must provide a
quickly visualizable justification for its decision. Interpretability
would also be essential to engender radiologist confidence and support
clinical trials of second-generation autonomous systems (Booth et al.,
2020). Ideally, this visualization would take the form of abnormal
tissue segmentation, with the model outputting pixel-level probabilities
in addition to accurate scan classification (i.e., normal vs. abnormal).
However, training such a model by supervised learning requires large
numbers of manually segmented images which are often not readily
available. One approach to circumvent this bottleneck is to directly
apply in clinical settings those models trained on curated open-access
data collections that do have segmentation labels, such as the Brain
Tumour Segmentation Challenge (BRATS) (Menze et al., 2015), or Ischemic
Stroke Lesion Segmentation Challenge (ISLES) (Winzeck et al., 2018)
datasets. However, these off-the-shelf models, being trained on
standardized and often heavily pre-processed (i.e., skull stripped,
spatially co-registered, isotropic) volumetric images, often suffer from
domain shift; in other words, they fail to generalise to less
homogeneous datasets such as the wide range of MRI scans generated at
hospitals.
An alternative approach is to develop a model trained on these less
homogeneous hospital datasets using simple classification information
(i.e., normal vs. abnormal scan) in order to coarsely localise, rather
than segment an abnormality (Wood et al., 2020), (Wood et al., 2021).
Localisation of this kind, although not suitable for precision
applications such as computer guided surgery and planning, is ideal for
triage systems where the priority is to quickly identify and present the
location of an abnormality for radiologist review (Din et al., 2023),
(Agarwal et al., 2023), (Wood et al., 2022).
In this work we present a hierarchical attention model for automated
abnormality detection from weak supervision labels. We characterise weak
as being at the series-level. An MRI series is the entire set of MRI
scans, incorporating multiple sequences (such as
T1-weighted, T2-weighted,
diffusion-weighted sequences), obtained during a patient’s scanning
session. Built around nested long short-term memory (LSTM) units and
convolutional neural networks (CNNs), the proposed network is suitable
for non-volumetric data (i.e., stacks of high-resolution MRI slices),
and can be trained on minimally processed images extracted from hospital
picture and archiving systems (PACS) and labelled using a recently
developed radiological report language model (ALARM) (Wood et al.,
2020). We show that this hierarchical approach leads to improved
classification, while coarsely localising the inter- and intra-slice
abnormality. The proposed approach is general, and would allow
integration of other information relating to a study e.g., data from
different imaging modalities (e.g., computed tomography (CT) or positron
emission tomography (PET)) or even non-imaging data such as patient
clinical history, all of which are highly desirable in a clinical
setting (Booth et al., 2020). We have demonstrated this integration by
incorporating multiple MRI sequences and have shown that such a strategy
outperforms sum-pooling models and recurrent networks without attention,
while providing importance scores for each sequence and slice.
Related work
Weakly supervised abnormality detection has attracted considerable
interest in recent years.
To date, most approaches have been based around class activation mapping
(CAM) (Zhou
et al., 2015), whereby candidate regions of interest generated using
fully convolutional net-
works are processed to generate pixel-level segmentation maps (Feng et
al., 2017), (Wei
et al., 2017), (Izadyyazdanabadi et al., 2018), (Wu et al., 2019). One
limitation of this
approach is the requirement of slice- rather than series-level labels,
meaning that all slices
from each sequence used in a training set need to be manually labelled
for the presence or
absence of an abnormality or lesion. This makes the construction of
large, labelled datasets
considerably more time-consuming and expensive. A further shortcoming is
the implicit
treatment of slices as being independent of each other, thereby failing
to leverage inter-slice
spatial dependencies for abnormality detection.
Our work builds on that of (Poudel et al., 2016) and (Cai et al., 2018),
treating variable-
length stacks of MRI slices as correlated information and processing
these data using re current convolutional networks. Crucially, we relax
the requirement for pixel-level labels by
incorporating a hierarchical attention mechanism - first introduced for
language modelling
(Yang et al., 2016) - to exploit the natural hierarchies present in
neuroimaging data. In
this way, our model is similar to (Zhang et al., 2017), (Yan et al.,
2019), (Cole et al., 2020),
and (Wood et al., 2019) who used visual attention to provide a form of
model interpretability for medical image analysis. To our knowledge,
however, this is the first demonstration
of using hierarchical attention for weakly-supervised neurological
abnormality detection.
Methods
Data
The UK National Health Research Authority and Research Ethics Committee
approved
this study. All 126,556 adult (≥ 18 years old) MRI head scans performed
at KCH hospital
between 2008 and 2019, were used in this study. MRI scans were obtained
on Signa 1.5
T HDX General Electric Healthcare; or AERA 1.5T, Siemens, Erlangen,
Germany. Using the ALARM radiology report classifier described in (Wood
et al., 2020) and (Wood
et al., 2022), all examinations were assigned a binary label,
corresponding to the presence or absence of an abnormality predicted on
the basis of the accompanying free text
neuroradiology report describing the study. The classification accuracy
of this model is
99.4%, so this labelling procedure is considered reliable. A subset of
600 abnormal examinations were then selected for inclusion using an
open-source annotation tool (available at
https://github.com/tomvars/sifter, see (Wood et al., 2020)).
Because the hospital
dataset consisted of MRIs obtained at different stages of the patient
pathway (including
initial diagnostic imaging, pre-surgical planning, immediate
post-surgical assessment, and
chemoradiotherapy response assessment over a longer period of
follow-up), the abnormalities incorporated were heterogeneous, including
tumours at diagnosis, resection cavities after surgery, and
post-treatment related effects at follow-up (Fig. 1). As such, this
dataset provided our model with abnormalities to train on that varied in
size and MRI signal characteristic. 600 series were then randomly
selected from a large subset labelled as normal to make a combined
balanced dataset of 1200 examinations.