Authorea

Xavier Holt edited Appendix_Baseline_Model_Evaluation_The__.md almost 8 years ago

Commit id: a0418991d5fa89c416610c2fddbfb5a4e93e8a1d

deletions | additions

We use the area under the receiver-operating-characteristic curve (AUC) as our evaluation metric. The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings. The area under this curve can be interpreted as the probability that our model gives a higher score to a randomly chosen positive example than a randomly chosen negative example. Our experimental parameters were the set of features used as well as the type of classifier. We tested a range of subset-configurations (indicated below) and compared a logistic-regression (logReg) model against one based on random-forests (rF). The hyperparameters of the logReg model were the penalty-metric (\(l^1-, l^2-\) or mixed-norms) and the regularisation parameter. In our rF model we optimised over maximum tree-depth. We see that our best AUC score of `0.84` used an rF model trained on the full set of features **(Fig. ?)**. We include the full ROC curve for this configuration **(Fig. ?)**. In fact rF models outperformed their logReg counterparts uniformly. Additionally, rF models were particularly good at consolidating the different features; in contrast to the logReg model, adding a feature to the rF model never decreased performance. The logReg model also made particularly poor use of the 'freshness/recency' feature. This was a noisy feature with several large outliers. As rF models are highly robust, we are unsurprised by this finding **(Fig. ?)**. * Reasonable performance, but model is very simplistic. * Combining articles which independently are likely to be useful doesn’t give us any guarantee about the overall quality/coverage. * We don’t directly model diversity of articles. * We don’t account for the temporal aspect. * BUT: The binary version of the model is a useful component in a more structured model.