Authorea

Xavier Holt edited Baseline_Validation_The_backbone_of__.md almost 8 years ago

Commit id: ae7885038e9d7e97d809f99f8d7cc5a9b4fa9c41

deletions | additions

## Baseline Validation The backbone of our baseline model is a binary-classifier. We first analyse the performance of this classifier in isolation. In ensuring the atomic elements of the baseline model perform well we seek to argue that the model is a reasonable benchmark. ### Dataset and Experimental Design We take one-thousand briefs from August 2015, five-hundred from November 2015 and five-hundred from December 2015. These constitute our training, tuning and test sets respectively. We constructed a family of classifiers using our training set with different hyper-parameter configurations. These were compared using their performance on the tuning set, and the strongest model was evaluated on the test set. This is the score presented in all results below. It was also important to account for out dataset's inherent temporality with our train/tune/test splits. We did not opt for a naive split where we simply sampled from a pool of briefs, ignoring the temporal aspect. The first reason for this is obvious -- when we make predictions, we want to condition on the past and predict the present/future. If we randomly shuffled a collection of articles, there would be cases where we would be breaking these assumptions. This is important for the scientific rigour of our evaluation as well as more closely resembling real usage-patterns. Another more subtle factor is the strong intra-document dependencies -- in a given time-period, there were a number of articles which appeared consistently in many briefs. We found that when we built a model on month \(x\) and evaluated it on month \(x+1\), we scored significantly higher than if we were evaluating on say month \(x+4\). This seemed to indicate that our one-month classifier considered the particulars of specific articles, not an article's 'brief-worthiness' in the abstract. This is a concern if we wished to use a larger range of historical data in the construction of our model. From an implementation perspective, it is also beneficial to be able to update our classifier less frequently. As such we include the \(x+4\) evaluation measure. We use the area under the receiver-operating-characteristic curve (AUC) as our evaluation metric. The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings. The area under this curve can be interpreted as the probability that our model gives a higher score to a randomly chosen positive example than a randomly chosen negative example. Our experimental parameters were the set of features used as well as the type of classifier. We tested a range of subset-configurations (indicated below) and compared a logistic-regression (logReg) model against one based on random-forests (rF). The hyperparameters of the logReg model were the penalty-metric (\(l^1-, l^2-\) or mixed-norms) and the regularisation parameter. In our rF model we optimised over maximum tree-depth. ### Results We see that our best AUC score of `0.84` used an rF model trained on the full set of features **(Fig. ?)**. We include the full ROC curve for this configuration **(Fig. ?)**. In fact rF models outperformed their logReg counterparts uniformly. Additionally, rF models were particularly good at consolidating the different features; in contrast to the logReg model, adding a feature to the rF model never decreased performance. The logReg model also made particularly poor use of the 'freshness/recency' feature. This was a noisy feature with several large outliers. As rF models are highly robust, we are unsurprised by this finding **(Fig. ?)**.