Transfer learning is another broadly applicable deep learning technique, where a number of layers from pretrained network are retrained for a different use case, which can drastically cut down the training time and labelled dataset size needed \cite{ahmed2008training,pan2010survey}. For example, the same transfer learning approach was used for brain MRI tissue segmentation (gray matter, white matter, and CSF) and for multiple sclerosis lesion segmentation \cite{van2015transfer}. Yet despite these advances in deep learning, a non-trivial amount of labelled data is still required to train CNNs, raising again the original problem: how does a single expert create an annotated dataset that is large enough to be amenable for a deep learning approach, for any task requiring expertise?
We hypothesized that expert decision making can be scaled up by citizen scientists that can learn from and amplify expert decisions, to the extent where deep learning approaches become feasible. As a proof of concept, we apply this approach to brain MRI quality control (QC): a binary classification task where images are labelled "pass" or "fail" based on image quality. QC is a paradigmatic example of the problem of scaling expertise. QC is subjective, and each researcher has their own standards as to which images pass or fail on inspection. The variability of expert subjectivity has problematic effects on downstream analyses, especially statistical inference: effect size estimates may depend on the input data to a statistical model, and varying QC criteria will add more uncertainty to these estimates, and might result in replication failures. For example, in \cite{ducharme2016trajectories}, the authors found that QC had a significant impact on their estimates of the trajectory of cortical thickness during development. They concluded that post-processing QC (in the form of visual inspection) is crucial for such studies, especially due to motion artifacts in younger children. It is therefore essential that we develop systems that can accurately emulate decisions, and that these systems are made openly available for the scientific community.
For this proof of concept, we developed a citizen-science amplification and CNN procedure for the openly available Healthy Brain Network dataset (HBN; \cite{alexander2017open}). This initiative aims to collect and publicly release data on 10,000 children over the next 6 years to facilitate the study of brain development and mental health through transdiagnostic research. The rich dataset includes MRI brain scans, EEG and eye tracking recordings, extensive behavioral testing, genetic sampling, and voice and actigraphy recordings. In order to understand the relationship between brain structure (based on MRI) and behavior (EEG, eye tracking, voice, actigraphy, behavioral data), or the association between genetics and brain structure, researchers require high quality MRI data.
In this study, we crowd-amplify image quality ratings and train a CNN on the first and second data releases of the HBN (n=722), which can be used to infer data quality on future data releases. We also demonstrate how choice of QC threshold is related to the effect size estimate on the established association between age and brain tissue volumes during development \cite{Lebel2011}. Finally, we show that our approach of deep learning trained on a crowd-amplified dataset matches state-of-the-art software built specifically for image QC \cite{esteban2017mriqc}. We therefore recommend employing our crowd-amplification method for any binary image classification task, particularly in the cases where specialized, fully automated software do not exist.
Results
Overview
The primary goal of this study was to 1) amplify a small, expertly labelled dataset through citizen science, 2) train a CNN on the amplified labels, and 3) evaluate its performance on a validation dataset. Figure
\ref{567433} shows an overview of the procedure and provides a summary of our results. At the outset, a group of neuroimaging experts created a gold-standard quality control dataset on a small subset of the data (n=200), through extensive visual examination of the full volumes of the data. In parallel, citizen scientists were asked to "pass" or "fail" two dimensional axial slices from the full dataset (n=722) through a web application that could be accessed through a desktop, tablet or mobile phone (
https://braindr.us). Amplified labels, that range from 0 (fail) to 1 (pass), were generated in 2 ways. First, by taking the average citizen-scientist rating for each slice. Second, by generating a probability score from a gradient-boosting classifier (XGBoost) trained to predict the label membership of each image, with all of the user's ratings as features. A receiver operating characteristic (ROC) curve was generated for both the averaged ratings and the classifier probability outputs, and the area under the curve (AUC) was computed. The AUC for the average crowd ratings was 0.95, and the AUC for the XGBoost probabilities was 0.97 on a validation set. Next, a transfer learning regression was trained to predict the XGBoost probability score from the two dimensional axial slices. The AUC for the predicted labels on a left out dataset was 0.99. As a validation, another XGBoost classifier was trained on the features of the MRIQC algorithm, and the AUC for the validation set was also 0.99.