Aggregating Citizen Scientist Ratings to Emulate Expert Labels

We used ratings of brain slice images by citizen scientists to increase the size of the training set. Citizen scientists who rated images through the interactive web application differed substantially in terms of how well their ratings match the experts ratings: while some agree with the experts most of the time, others disagree with them for a substantial portion of the brain slices rated. To create image labels that more closely match the expert opinion, we assigned a weight to each citizen scientist based on their match to expert agreement in slices from the gold-standard set. We used the XGBoost algorithm \cite{Chen2016}, an ensemble method that combines a set of weak learners (decision trees) to predict the "gold" standard labels based on a set of features. In our case, the features were the average rating of the slice image from each rater (some images were viewed and rated more than once, so image ratings could vary between 1=always "pass" and 0=always "fail"). We could then use the rater weights to predict on the left out test set using the from XGBoosted labels as the probability score of the trained XGBoost model's classification. Figure \ref{468392}A shows ROC curves of the left-out test set for different training set sizes, compared to the ROC curve of the average rating. We see a slight improvement in the AUC of the XGBoosted labels (0.97) compared to the AUC of the average labels (0.95). Using the model trained on two-thirds of the data (n=670), we extracted the probability scores of the classifier on all slices in braindr (see Figure \ref{468392}B). The distribution of probability scores in Figure \ref{468392}B matches our expectations of the data; a bimodal distribution with peaks at 0 and 1. The XGBoost model also calculates a feature importance score (F). F is the number of times that feature (in our case, rater) has split the branches of a tree, summed over all boosted trees. Figure \ref{468392}C shows the feature importance for each rater. Figure \ref{468392}D shows the a relationship between a rater's importance compared to the number of images they rated. In general, the more number of images a user rates, the more important they are to the model. However, there are still a few exceptions where a rater scored many images, but were incorrect, so the model gave their ratings less weight during aggregation.