Authorea

Xavier Holt edited Evaluation_Methodology_Our_first_major__.md almost 8 years ago

Commit id: ddbc46351a4793a4a213b2a19c32d296d11cc35c

deletions | additions

## Evaluation Methodology Our first major contribution is developing a sound scientific methodology for evaluating timelines. Different readers may prefer one timeline over another based on writing style, length or the types of events covered. Even timelines with the same atomic events and semantic content can vary. Rewording, re-ordering or selecting similar sentences from different sources could all result in a large number of equally valid timelines. Any method of evaluation has to be sufficiently robust. ### ## State of the Art Timeline evaluation is still an open problem. This can be seen as a function of the task difficulty. Evaluation is subjective. Different readers may prefer one timeline over another based on writing style, length or the types of events covered. Even timelines with the same atomic events and semantic content can vary. Rewording, re-ordering or selecting similar sentences from different sources could all result in a large number of equally valid timelines. Any method of evaluation has to be sufficiently robust. The methods present in the literature so far largely fall in the following camps: Several TLG approaches adopt a ROUGE metric for this task \cite{Wang2013,Yan2011,Yan2011a}. These are a family of metrics used for summarisation evaluation. In our case, our reference for comparison is the gold standard timelines. ROUGE is an imperfect automatic process which has been criticised for not providing a good measure of content quality \cite{reiter2009investigation}. An alternative is preference voting. Chieu et al. uses a crowd-vote approach \cite{Chieu:2004id}. They present crowd-workers with both the gold standard and a system-generated timeline. The workers then preference which of the two they prefer. This approach is more costly than automatic evaluation. On the other hand, a timeline is intended to be human readable so having a reader evaluate its quality is appealing. In the section below we cover crowd-based approaches without a gold standard in more detail.

Althoff et al. adopts this approach by presenting timelines for comparisons to Mechanical Turk workers \cite{Althoff:2015dg}. This approach is nice in a few ways. It is applicable to timelines generated from a wide range of domains. A voter could just as easily prefrence a timeline based on tweets, news stories or biographies. Some domain-knowledge might be required dependent on application, but this is an issue with crowd-work generally. Even if an individual doesn't possess knowledge of the topic, they can still preference based on other features such as redundancy, length or clarity. There are nevertheless some issues which arise. Cost is a factor. For every topic, we need to employ a crowd of people to perform pair-wise voting. This restricts the number of models we can generate as our comparisons grow quadratically. Still, the expense could be reasonable dependent on application. The real issue is that the best system generated timeline might still be bad. With no gold standard, we have no real guarantee. This also has consequences for how we can apply the voting. If we would like to compare different TLG models then we would have to first implement them ourselves. This introduces a possible bias where alternative models may implemented poorly. ### ## Proposed Contribution We therefore propose a framework where we use our numeric measures for exploratory analysis and model-comparison. Specifically we will use the ROUGE family of metrics as well as perplexity when first developing our model and in hyperparameter selection. Our evaluation process relies upon the existence of gold-standard timelines. As discussed in the literature review there is currently no dataset that matches this description. Our final contribution will to be to develop such a corpus.