Abstract
Clinical neuroimaging data is naturally hierarchical. Different magnetic resonance imaging (MRI) sequences within a series, different slices covering the head, and different regions within each slice all confer different information. In this work we present a hierarchical attention network for abnormality detection using MRI scans obtained in a clinical hospital setting. The proposed network is suitable for non-volumetric data (i.e., stacks of high-resolution MRI slices), and can be trained from binary examination-level labels. We show that this hierarchical approach leads to improved classification, while providing interpretability through either coarse inter- and intra-slice abnormality localisation, or giving importance scores for different slices and sequences, making our model suitable for use as an automated triaging system in radiology departments.
Introduction
Deep learning-based computer vision systems hold promise for automatically triaging patients in hospital radiology departments. In the UK, for example, with a 4.6% increase in brain magnetic resonance imaging (MRI) scans performed in the last 12 months alone (NHS, 2019), and with an increase in the time taken to report out-patient MRI scans every year since 2012, an automated triage mechanism to identify abnormalities at the time of imaging, and thereby allow prioritised scan reporting, is urgently needed. Such a mechanism would potentially allow early intervention to improve short- and long-term clinical outcomes. Assuming that a first-generation system will operate by assisting real-time radiologist review, any prospective model must provide a quickly visualizable justification for its decision. Interpretability would also be essential to engender radiologist confidence and support clinical trials of second-generation autonomous systems (Booth et al., 2020). Ideally, this visualization would take the form of abnormal tissue segmentation, with the model outputting pixel-level probabilities in addition to accurate scan classification (i.e., normal vs. abnormal). However, training such a model by supervised learning requires large numbers of manually segmented images which are often not readily available. One approach to circumvent this bottleneck is to directly apply in clinical settings those models trained on curated open-access data collections that do have segmentation labels, such as the Brain Tumour Segmentation Challenge (BRATS) (Menze et al., 2015), or Ischemic Stroke Lesion Segmentation Challenge (ISLES) (Winzeck et al., 2018) datasets. However, these off-the-shelf models, being trained on standardized and often heavily pre-processed (i.e., skull stripped, spatially co-registered, isotropic) volumetric images, often suffer from domain shift; in other words, they fail to generalise to less homogeneous datasets such as the wide range of MRI scans generated at hospitals.
An alternative approach is to develop a model trained on these less homogeneous hospital datasets using simple classification information (i.e., normal vs. abnormal scan) in order to coarsely localise, rather than segment an abnormality (Wood et al., 2020), (Wood et al., 2021). Localisation of this kind, although not suitable for precision applications such as computer guided surgery and planning, is ideal for triage systems where the priority is to quickly identify and present the location of an abnormality for radiologist review (Din et al., 2023), (Agarwal et al., 2023), (Wood et al., 2022).
In this work we present a hierarchical attention model for automated abnormality detection from weak supervision labels. We characterise weak as being at the series-level. An MRI series is the entire set of MRI scans, incorporating multiple sequences (such as T1-weighted, T2-weighted, diffusion-weighted sequences), obtained during a patient’s scanning session. Built around nested long short-term memory (LSTM) units and convolutional neural networks (CNNs), the proposed network is suitable for non-volumetric data (i.e., stacks of high-resolution MRI slices), and can be trained on minimally processed images extracted from hospital picture and archiving systems (PACS) and labelled using a recently developed radiological report language model (ALARM) (Wood et al., 2020). We show that this hierarchical approach leads to improved classification, while coarsely localising the inter- and intra-slice abnormality. The proposed approach is general, and would allow integration of other information relating to a study e.g., data from different imaging modalities (e.g., computed tomography (CT) or positron emission tomography (PET)) or even non-imaging data such as patient clinical history, all of which are highly desirable in a clinical setting (Booth et al., 2020). We have demonstrated this integration by incorporating multiple MRI sequences and have shown that such a strategy outperforms sum-pooling models and recurrent networks without attention, while providing importance scores for each sequence and slice.
Related work
Weakly supervised abnormality detection has attracted considerable interest in recent years. To date, most approaches have been based around class activation mapping (CAM) (Zhou et al., 2015), whereby candidate regions of interest generated using fully convolutional net- works are processed to generate pixel-level segmentation maps (Feng et al., 2017), (Wei et al., 2017), (Izadyyazdanabadi et al., 2018), (Wu et al., 2019). One limitation of this approach is the requirement of slice- rather than series-level labels, meaning that all slices from each sequence used in a training set need to be manually labelled for the presence or absence of an abnormality or lesion. This makes the construction of large, labelled datasets considerably more time-consuming and expensive. A further shortcoming is the implicit treatment of slices as being independent of each other, thereby failing to leverage inter-slice spatial dependencies for abnormality detection.
Our work builds on that of (Poudel et al., 2016) and (Cai et al., 2018), treating variable- length stacks of MRI slices as correlated information and processing these data using re current convolutional networks. Crucially, we relax the requirement for pixel-level labels by incorporating a hierarchical attention mechanism - first introduced for language modelling (Yang et al., 2016) - to exploit the natural hierarchies present in neuroimaging data. In this way, our model is similar to (Zhang et al., 2017), (Yan et al., 2019), (Cole et al., 2020), and (Wood et al., 2019) who used visual attention to provide a form of model interpretability for medical image analysis. To our knowledge, however, this is the first demonstration of using hierarchical attention for weakly-supervised neurological abnormality detection.
Methods
Data
The UK National Health Research Authority and Research Ethics Committee approved this study. All 126,556 adult (≥ 18 years old) MRI head scans performed at KCH hospital between 2008 and 2019, were used in this study. MRI scans were obtained on Signa 1.5 T HDX General Electric Healthcare; or AERA 1.5T, Siemens, Erlangen, Germany. Using the ALARM radiology report classifier described in (Wood et al., 2020) and (Wood et al., 2022), all examinations were assigned a binary label, corresponding to the presence or absence of an abnormality predicted on the basis of the accompanying free text neuroradiology report describing the study. The classification accuracy of this model is 99.4%, so this labelling procedure is considered reliable. A subset of 600 abnormal examinations were then selected for inclusion using an open-source annotation tool (available at https://github.com/tomvars/sifter, see (Wood et al., 2020)). Because the hospital dataset consisted of MRIs obtained at different stages of the patient pathway (including initial diagnostic imaging, pre-surgical planning, immediate post-surgical assessment, and chemoradiotherapy response assessment over a longer period of follow-up), the abnormalities incorporated were heterogeneous, including tumours at diagnosis, resection cavities after surgery, and post-treatment related effects at follow-up (Fig. 1). As such, this dataset provided our model with abnormalities to train on that varied in size and MRI signal characteristic. 600 series were then randomly selected from a large subset labelled as normal to make a combined balanced dataset of 1200 examinations.