Discussion
We have developed a system to scale expertise in brain MRI quality control. The system uses citizen scientists to amplify and increase the size of an initially-small dataset. Combined with deep learning (via CNNs), the system can then perform image analysis tasks, such as quality control (QC). We have validated our method against MRIQC, a specialized tool that already exists for this particular use \cite{Esteban2017}. Unlike MRIQC, our method is able to generalize beyond the classification of T1-weighted images; any image-based binary classification task can be loaded onto the Braindr platform, and crowdsourced via the web. We have also demonstrated the importance of scaling QC expertise by showing how replication of a previously established result on gray matter volume over time during development \cite{Lebel2011}, depends on a researcher's decision on data quality. In the following sections, we discuss in more depth the various concepts that are related to this work; in particular we 1) discuss the impact of the internet and web-applications for collaboration, 2) review research on MRI quality control and morphometrics over brain development, 3) discuss limitations of our method, and 4) propose future directions.
The Internet and Web Applications for Collaboration
The internet and web browser are not only crucial for scientific communication, but also for collaboration and distribution of work. Recent efforts in citizen science projects for neuroscience research have proven extremely useful and popular, in part due to the ubiquity of the web browser. Large scale web-based citizen science projects, like EyeWire \cite{kim2014space,marx2013neuroscience} and Mozak \cite{roskams2016power} have enabled scientists working with high resolution microscopy data to map neuronal connections at the microscale with help from over 100,000 citizen scientists. In MR imaging, web-based tools such as BrainBox \cite{heuer2016open} and Mindcontrol \cite{Keshavan2017} were built to facilitate the collaboration of neuroimaging experts in image segmentation and quality control. However, the task of inspecting each slice of a 3D image in either BrainBox or Mindcontrol takes too long, and this complex task tends to lose potential citizen scientists who find it too difficult or time consuming.
In order to simplify the task for citizen scientists, we developed a web application called braindr \cite{keshavan2018}, which reduces the time-consuming task of slice-by-slice 3D inspection to a quick decision made on a 2D slice. Using braindr, citizen scientists amplified the initial expert-labelled dataset ( 200 3D images) to the entire dataset (> 700 3D images, > 3000 2D slices) in a very short time. Because braindr is a lightweight web application, users could play it at any time and on any device, and this meant we were able to attract many users. On braindr, each slice received on average 20 ratings, and therefore each 3D brain (consisting of 5 slices) received 100 ratings. In short, by redesigning the way we interact with our data and presenting it on the web browser, we were able to get many more eyes on our data than would have been possible in a single research lab.
MRI Quality Control and Morphometrics over Development
Recently, Ducharme and colleagues \cite{Ducharme2016} stressed the importance of quality control on brain morphometry studies in development in a large study of 954 subjects. They estimated cortical thickness on each point of a cortical surface and fit linear, quadratic and cubic models of thickness versus age at each vertex. Quality control was performed by visual inspection of the reconstructed cortical surface, and removing data that failed QC from the analysis. Without stringent quality control, the best fit models were more complex (quadratic/cubic), and with quality control the best fit models were linear. They found sex differences only at the occipital regions, which thinned faster in males. In Figure \ref{182176}, we presented an interactive chart where users can similarly explore different ordinary least squares models (linear or quadratic) and also split by sex for the relationship between total gray matter volume, white matter volume, CSF volume, and total brain volume over age. One possible future direction of this work is to create 2D images of cortical surfaces to QC via the braindr application, in order to replicate the results of Ducharme and colleagues.
We chose to QC raw MRI data in this study, rather than the processed data because the quality of the raw MRI data affects the downstream cortical mesh generation, and many other computed metrics. A large body of research in automated QC of T1-weighted images exists, in part because of large open data sharing initiatives. In 2009, Mortamet and colleagues \cite{mortamet2009automatic} developed a QC algorithm based on the background of magnitude images of the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, and reported a sensitivity and specificity of > 85%. In 2015, Shezad and colleagues \cite{shehzadpreprocessed} developed the Preprocessed Connectomes Project Quality Assessment Protocol (PCP-QAP) on the Autism Brain Imaging Data Exchange (ABIDE) and Consortium for Reproducibility and Reliability (CoRR) datasets. The PCP-QAP also included a Python library to easily compute metrics such as signal to noise ratio, contrast to noise ratio, entropy focus criterion, foreground-to-background energy ratio, voxel smoothness, and percentage of artifact voxels. Building on this work, the MRIQC package from Esteban and colleagues \cite{Esteban2017a} includes a comprehensive set of 64 image quality metrics, from which a classifier was trained to predict data quality of the ABIDE dataset for new, unseen sites with 76% accuracy.
Our strategy differed from that of the MRIQC classification study. In \cite{Esteban2017a}, the authors labelled images that were "doubtful" in quality as a "pass" when training and evaluating their classifier. Our MRIQC classifier was trained and evaluated only on images that our raters very confidently passed or failed. Because quality control is subjective, we felt that it was acceptable for a "doubtful" image to be failed by the classifier. Since our classifier was trained on data acquired within a single site, and only on images that we were confident about, our MRIQC classifier achieved near perfect accuracy with an AUC of 0.99. On the other hand, our braindr CNN was trained as a regression (rather than a classification) on the full dataset, including the "doubtful" images (i.e those with ratings closer to 0.5), but was still evaluated as a classifier against data we were confident about. This also achieved near-perfect accuracy with an AUC of 0.99. Because both the MRIQC and braindr classifiers perform so well on data we are confident about, we contend that it is acceptable to let the classifier act as a "tie-breaker" for images that lie in the middle of the spectrum, for all future acquisitions of the HBN dataset.
Limitations
One limitation of this method is that there is an interpretability to speed tradeoff. Specialized QC tools were developed over many years, while this study was performed in a few months. Specialized QC tools are far more interpretable; for example, the coefficient of joint variation (CJV) metric from MRIQC is sensitive to the presence of head motion. CJV was one of the most important features of our MRIQC classifier, implying that our raters were primarily sensitive to motion artifacts. This conclusion is difficult to come to when interpreting the braindr CNN. Because we employed transfer learning, the features that were extracted were based on the ImageNet classification task, and it is unclear how these features related to MRI-specific artifacts. However, interpretability of deep learning is an ongoing active field of research \cite{chakraborty2017interpretability}, and we may be able to fit more interpretable models in the future.
Compared to previous efforts to train models to predict quality ratings, such as MRIQC \cite{Esteban2017a}, our AUC scores are very high. There are two main reasons for this. First, in \cite{Esteban2017a}, the authors tried to predict the quality of scans from unseen sites, whereas in our study, we combined data across all sites. Second, even though our quality ratings on the 3D dataset were continuous scores (ranging from -5 to 5), we only evaluated the performance of our models on data that received an extremely high (4,5) or extremely low score (-4,-5). This was because quality control is very subjective, and therefore there is more variability on images that people are unsure about. An image that was failed with low confidence (-3 to -1) by one researcher could conceivably be passed with low confidence by another researcher (1-3). Most importantly, our study had enough data to drop the images in this "gray area" in order to train our XGBoosted model on both the braindr ratings and the MRIQC features. In studies with less data, such an approach might not be feasible.
Another limitation of this method was that our citizen scientists were primarily neuroscientists. The braindr application was advertised on Twitter (
https://www.twitter.com) by the authors, whose social networks primarily consisted of neuroscientists. As the original tweet travelled outside our social network, we saw more citizen scientists without experience looking at brain images on the platform, but the number of ratings they did were not as high as those with neuroscience experience. We also saw that there was an overall tendency for all our users to incorrectly pass images. Future iterations of braindr will include a more informative tutorial and random checks with known images throughout the game to make sure our players are well informed and are performing well throughout the task. In this study, we were able to overcome this limitation because we had enough ratings to train the XGBoost algorithm to preferentially weight some user's ratings over others.
Future Directions
Citizen science platforms like the Zooniverse \cite{simpson2014zooniverse} enable researchers to upload tasks and engage over 1 million citizen scientists. We plan to integrate braindr into a citizen science platform like Zooniverse. This would enable researchers to upload their own data to braindr, and give them access to a diverse group of citizen scientists, rather than only neuroscientists within their social network. We also plan to reuse the braindr interface for more complicated classification tasks in brain imaging. An example could be the classification of ICA components as signal or noise \cite{griffanti2017hand}, or the evaluation of segmentation algorithms. Finally, incorporating braindr with existing open data initiatives, like OpenNeuro \cite{gorgolewski2017openneuro}, or existing neuroimaging platforms like LORIS \cite{das2012loris} would enable scientists to directly launch braindr tasks from these platforms, which would seamlessly incorporate human in the loop data analysis in neuroimaging research.
Methods
The Healthy Brain Network Dataset
The first two releases of the Healthy Brain Network dataset were downloaded from
http://fcon_1000.projects.nitrc.org/indi/cmi_healthy_brain_network/sharing_neuro.html . A web application for brain quality control, called Mindcontrol
\cite{Keshavan2017} was hosted at
https://mindcontrol-hbn.herokuapp.com , which enabled users to view and rate 3D MRI images in the browser. There were 724 T1-weighted images. All procedures were approved by the University of Washington Institutional Review Board (IRB). Mindcontrol raters provided onformed consent, including consent to publicly release these ratings. Raters were asked to pass or fail images after inspecting the full 3D volume, and provide a score of their confidence on a 5 point Likert scale, where 1 was the least confident and 5 was the most confident. Raters received a point for each new volume they rated, and a leaderboard on the homepage displayed rater rankings. The ratings of the top 4 raters (including the lead author) were used to create a "gold standard" subset of the data.
"Gold" Standard Selection
The "gold" standard subset of the data was created by selecting images that were confidently passed (confidence > 4) or confidently failed (confidence < -4) by the 4 raters. In order to measure reliability between raters, the ratings of the second, third, and fourth rater were recoded to a scale of -5 to 5 (where -5 is confidently failed, and 5 is confidently passed). An ROC analysis was performed against the binary ratings of the lead author on the commonly rated images, and the area under the curve (AUC) was computed for each pair. An average AUC, weighted by the number of commonly rated images between the pair, was 0.97, showing good agreement between raters. The resulting "gold" standard dataset consisted of 200 images. Figure \ref{169530} shows example axial slices from the "gold" standard dataset.