Authorea

Xavier Holt edited Baseline_We_include_a_baseline__.md almost 8 years ago

Commit id: b4b66b71bb0bd8ad739f9a3a0ca8d0ad14edf866

deletions | additions

## Baseline We include a baseline model for comparison. * Input: brief metadata and linked articles (Bing – don’t laugh).

### Binary Classification to Timeline ### Dataset We use a dataset provided to us by Hugo, an AI personal-assistant company based in Surry Hills. The dataset is comprised of a series of 'briefs'. A brief is comprised of both a named entity and a set of entity-linked articles. The in-house team chooses a small subset of these articles which best summarise the entity. We simply use this information to construct an article-level tag indicating if it was 'brief-worthy' or not. We take one-thousand briefs from August 2015, five-hundred from November 2015 and five-hundred from December 2015. These constitute our training, tuning and test sets respectively. We constructed a family of classifiers using our training set with different hyper-parameter configurations. These were compared using their performance on the tuning set, and the strongest model was evaluated on the test set. This is the score presented in all results below. It was also important to account for out dataset's inherent temporality with our train/tune/test splits. We did not opt for a naive split where we simply sampled from a pool of briefs, ignoring the temporal aspect. The first reason for this is obvious -- when we make predictions, we want to condition on the past and predict the present/future. If we randomly shuffled a collection of articles, there would be cases where we would be breaking these assumptions. This is important for the scientific rigour of our evaluation as well as more closely resembling real usage-patterns. Another more subtle factor is the strong intra-document dependencies -- in a given time-period, there were a number of articles which appeared consistently in many briefs. We found that when we built a model on month \(x\) and evaluated it on month \(x+1\), we scored significantly higher than if we were evaluating on say month \(x+4\). This seemed to indicate that our one-month classifier considered the particulars of specific articles, not an article's 'brief-worthiness' in the abstract. This is a concern if we wished to use a larger range of historical data in the construction of our model. From an implementation perspective, it is also beneficial to be able to update our classifier less frequently. As such we include the \(x+4\) evaluation measure. ### Experiments We use the area under the receiver-operating-characteristic curve (AUC) as our evaluation metric. The ROC curve is created by plotting the true positive rate against the false positive rate at various threshold settings. The area under this curve can be interpreted as the probability that our model gives a higher score to a randomly chosen positive example than a randomly chosen negative example. Our experimental parameters were the set of features used as well as the type of classifier. We tested a range of subset-configurations (indicated below) and compared a logistic-regression (logReg) model against one based on random-forests (rF). The hyperparameters of the logReg model were the penalty-metric (\(l^1-, l^2-\) or mixed-norms) and the regularisation parameter. In our rF model we optimised over maximum tree-depth.