Rating aggregation with XGBoost

The distribution on Figure \ref{358654} A was bimodal, as expected, but the left peak (corresponding to failed slices) was shifted to the right. Because these were gold standard images, selected because neuroimaging experts who confidently passed or failed these images, we expected the left peak of the bimodal distribution to be at 0. This implied that some users were incorrectly passing images. To select the users who rated more similarly to the gold standard raters, we trained an XGBoost classifier \cite{chen2016xgboost} implemented in Python (http://xgboost.readthedocs.io/en/latest/python/python_intro.html) using the cross-validation functions from the scikit-learn Python library \cite{pedregosa2011scikit}. We used 600 estimators, and grid searched over a  stratified 10-fold cross-validation within the training set to select the optimal maximum depth (2 vs 6) and learning rate (0.01, 0.1). The features of the model were the braindr players and their average rating on each slice (i.e. an observation). We trained the classifier on splits of various sizes of the data to test the dependence on training size (see Figure \ref{468392}A). We used the model trained with n=670 to extract the probability scores of the classifier on all 3609 slices in braindr (see Figure \ref{468392}B). The distribution of probability scores in Figure \ref{468392}B better matches our expectations of the data; a bimodal distribution with peaks at 0 and 1. Feature importances were extracted from the model and plotted in Figure \ref{468392}C, and plotted against total number of "gold" standard image ratings in Figure \ref{468392}D.