Authorea

Xavier Holt edited Timeline_Evaluation_Methodology_Timeline_evaluation__.md almost 8 years ago

Commit id: 25c9a855c524c803d60386f561adad5fdd7dece4

deletions | additions

### Entity Selection Process Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor We propose to select twenty figures who are central to the US presidential election. Current forerunners Bernie Sanders, Donald Trump, Hillary Clinton, John Kasich and Ted Cruz will be included. Additionally we will select a number of figures with less news coverage to allow evaluation on models with more sparse input. Gold-standard generation will involve presenting crowd workers with a list of news URLS. They will then be asked to indicate whether or not to include an article in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. a gold-standard timeline. ### Metric Based Comparisons Once a gold standard has been found, there are several approaches used to evaluate a candidate timeline.Several TLG approaches adopt a ROUGE metric for this task \cite{Wang2013,Yan2011,Yan2011a}. These are a family of metrics used for summarisation evaluation. In our case, our reference for comparison is the gold standard timelines. ROUGE is an imperfect automatic process which has been criticised for not providing a good measure of content quality \cite{reiter2009investigation}. An alternative is preference voting. Chieu et al. uses a crowd-vote approach \cite{Chieu:2004id}. They present crowd-workers with both the gold standard and a system-generated timeline. The workers then preference which of the two they prefer. This approach is more costly than automatic evaluation. On the other hand, a timeline is intended to be human readable so having a reader evaluate its quality is appealing. In the section below we cover crowd-based approaches without a gold standard in more detail. It is worth noting that both of these evaluation methods are only as good as the gold-standard which generates them. ROUGE is a metric of distributional similarity to the gold standard. As such,a poor gold standard can bound the performance of any model based on it. On the other hand in a binary comparative crowd-vote, including poorer gold standards would result in the system generated timeline being preferenced more. As such it is important to argue a gold standard is appropriate before using it in evaluation. ### Crowd-Vote An alternative is preference voting. Chieu et al. uses a crowd-vote approach \cite{Chieu:2004id}. They present crowd-workers with both the gold standard and a system-generated timeline. The workers then preference which of the two they prefer. Althoff et al. adopts this approach by presenting timelines for comparisons to Mechanical Turk workers \cite{Althoff:2015dg}. This approach is nice in a few ways. It is applicable to timelines generated from a wide range of domains. A voter could just as easily prefrence a timeline based on tweets, news stories or biographies. Some domain-knowledge might be required dependent on application, but this is an issue with crowd-work generally. Even if an individual doesn't possess knowledge of the topic, they can still preference based on other features such as redundancy, length or clarity. There are nevertheless some issues which arise. Cost is a factor. For every topic, we need to employ a crowd of people to perform pair-wise voting. This restricts the number of models we can generate as our comparisons grow quadratically. Still, the expense could be reasonable dependent on application. The real issue is that the best system generated timeline might still be bad. With no gold standard, we have no real guarantee. This also has consequences for how we can apply the voting. If we would like to compare different TLG models then we would have to first implement them ourselves. This introduces a possible bias where alternative models may implemented poorly. asda