Authorea

Xavier Holt edited Gold_Timeline_Generation_This_approach__.md almost 8 years ago

Commit id: 5498c8274c92b598b5c44c3249e50c09b7a7c649

deletions | additions

## The Framework We therefore propose a framework where we use our numeric measures for exploratory analysis and model-comparison. Specifically we will use the ROUGE family of metrics as well as perplexity when first developing our model and in hyperparameter selection. Our evaluation process relies upon the existence of gold-standard timelines. As discussed in the literature review there is currently no dataset that matches this description. Our final contribution will to be to develop such a corpus. Once we have used automatic methods to determine our ideal model, we will use a crowd-vote to determine overall performance. This will take the form of a binary preferencing problem between a collection of system-generated and one gold-standard timeline. This hybrid approach provides the advantages of both numeric and crowd methods while mitigating their downsides. To date this is the first such method of evaluation for the TLG problem. ### Gold Timeline Generation This approach generates a timeline that is in some sense 'correct' or 'best'. Different system generated timelines are then compared against the standard. The closer or more similar by some measure to the standard, the better the generated timeline is. ## ### Entity Selection Process We propose to select twenty figures who are central to the US presidential election. Current forerunners Bernie Sanders, Donald Trump, Hillary Clinton, John Kasich and Ted Cruz will be included. Additionally we will select a number of figures with less news coverage to allow evaluation on models with more sparse input. Gold-standard generation will involve presenting crowd workers with a list of news URLS. They will then be asked to indicate whether or not to include an article in a gold-standard timeline. ## ### Metric Based Comparisons Once a gold standard has been found, there are several approaches used to evaluate a candidate timeline. In ROUGE, timeline quality is measured by the amount of overlapping units, (e.g. word n-grams) between a timeline and the given gold-standard. Intuitively, the higher the ROUGE scores, the similar the two summaries are. ## ### Crowd-Vote asda