Deep  learning to predict image QC label

Finally, a deep learning model was trained on the brain slices to predict the XGBoost probability score. All brain slices were resized to 256 by 256 pixels and converted to 3 color channels (RGB). The data was split into 80%-10%-10% training-validation-test sets. The data was split such that all slices belonging to the same subject were grouped together, so that there wasn't any spillover across the training, validation, and test sets. We loaded the pretrained VGG16 network \cite{simonyan2014very} implemented in Keras \cite{chollet2015keras}, removed the top layer, and ran inference on all the data. The output of the VGG16 inference was then used to train a small sequential neural network consisting of a dense layer with 256 nodes and a rectified linear unit activation function (ReLu), followed by a dropout layer set to drop 50% of the weights to prevent overfitting, and finally a single node output layer with sigmoid activation. The training of the final layer was run for 50 epochs and the best model on the validation set across the 50 epochs was saved. We ran this model 10 separate times, each time with a different random initialization seed, in order to measure the variability of our ROC AUC on the test set. 

Training the MRIQC model

MRIQC was run on all images in the HBN dataset. The extracted QC features were used to train another XGBoost classifier to predict "gold" standard labels. Two thirds of the data was used to train the model, where a 2-fold cross-validation was used to optimize hyper parameters: learning rate = 0.001, 0.01, 0.1, number of estimators = 200, 600, and maximum depth = 2,6,8. An ROC analysis was run, and the computed area under the curve was 0.99. 

Gray matter volume vs age during development

Finally, to explore the relationship between gray matter volume and age over development as a function of QC threshold, gray matter volume was computed from running the Mindboggle software \cite{klein2017mindboggling} on the entire dataset. Extremely low quality scans did not make it through the entire Mindboggle pipeline, and as a result the dataset size was reduced to 629 for this part of the analysis. The final QC score for the brain volumes was computed by taking the average of the predicted braindr rating from the deep learning model for all five slices. We ran an ordinary least squares (OLS) model on gray matter volume versus age on the data with and without QC thresholding, where the QC threshold was set at 0.7. Figure \ref{182176} shows the result of this analysis, which showed an effect size that nearly doubled and replicated previous findings when QC was performed on the data. 

Acknowledgements

This research was supported through a grant from the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation to the University of Washington eScience Institute. A.K is also supported through a fellowship from the eScience Institute and the University of Washington Institute for Neuroengineering. We'd like to acknowledge the following people for fruitful discussions and contributions to the project. Dylan Nielson, Satra Ghosh and Dave Kennedy for the inspiration for braindr. Greg Kiar, for contributing badges to the braindr application. Chris Markiewicz, for discussions on application performance, and for application testing in the early stages. Katie Bottenhorn, Dave Kennedy, and Amanda Easson for quality controlling the gold standard dataset. Jamie Hanson, for sharing the MRIQC metrics. Chris Madan, for application testing and for discussions regarding QC standards. Arno Klein and Lei Ai, for providing us the segmented images from the HBN dataset. Tal Yarkoni and Alejandro de la Vega, for organizing a "code rodeo" for neuroimagers in Austin, TX, where the idea for braindr was born. Finally, we'd like to thank all the citizen scientists who swiped on braindr - we are very grateful for your contributions!