Authorea

Xavier Holt added The_backbone_of_our_baseline__.md almost 8 years ago

Commit id: d86a7d8226393b5c4a1d7d797c0f97d3a2d7304c

deletions | additions

The backbone of our baseline model is a binary-classifier. We first analyse the performance of this classifier in isolation. In ensuring the atomic elements of the baseline model perform well we seek to argue that the model is a reasonable benchmark. ### Dataset and Experimental Design We take one-thousand briefs from August 2015, five-hundred from November 2015 and five-hundred from December 2015. These constitute our training, tuning and test sets respectively. We constructed a family of classifiers using our training set with different hyper-parameter configurations. These were compared using their performance on the tuning set, and the strongest model was evaluated on the test set. This is the score presented in all results below. It was also important to account for out dataset's inherent temporality with our train/tune/test splits. We did not opt for a naive split where we simply sampled from a pool of briefs, ignoring the temporal aspect. The first reason for this is obvious -- when we make predictions, we want to condition on the past and predict the present/future. If we randomly shuffled a collection of articles, there would be cases where we would be breaking these assumptions. This is important for the scientific rigour of our evaluation as well as more closely resembling real usage-patterns. Another more subtle factor is the strong intra-document dependencies -- in a given time-period, there were a number of articles which appeared consistently in many briefs. We found that when we built a model on month \(x\) and evaluated it on month \(x+1\), we scored significantly higher than if we were evaluating on say month \(x+4\). This seemed to indicate that our one-month classifier considered the particulars of specific articles, not an article's 'brief-worthiness' in the abstract. This is a concern if we wished to use a larger range of historical data in the construction of our model. From an implementation perspective, it is also beneficial to be able to update our classifier less frequently. As such we include the \(x+4\) evaluation measure. We use the area under the receiver-operating-characteristic curve (AUC) as our evaluation metric. The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings. The area under this curve can be interpreted as the probability that our model gives a higher score to a randomly chosen positive example than a randomly chosen negative example. Our experimental parameters were the set of features used as well as the type of classifier. We tested a range of subset-configurations (indicated below) and compared a logistic-regression (logReg) model against one based on random-forests (rF). The hyperparameters of the logReg model were the penalty-metric (\(l^1-, l^2-\) or mixed-norms) and the regularisation parameter. In our rF model we optimised over maximum tree-depth.