Authorea

Xavier Holt edited Our_first_major_contribution_is__.md almost 8 years ago

Commit id: d6fafecda039061184c2bec8aed495ad4cccee18

deletions | additions

An alternative is preference voting. Chieu et al. uses a crowd-vote approach \cite{Chieu:2004id}. They present crowd-workers with both the gold standard and a system-generated timeline. The workers then preference which of the two they prefer. Althoff et al. adopts this approach by presenting timelines for comparisons to Mechanical Turk workers \cite{Althoff:2015dg}. This approach is nice in a few ways. It is applicable to timelines generated from a wide range of domains. A voter could just as easily prefrence a timeline based on tweets, news stories or biographies. Some domain-knowledge might be required dependent on application, but this is an issue with crowd-work generally. Even if an individual doesn't possess knowledge of the topic, they can still preference based on other features such as redundancy, length or clarity. There are nevertheless some issues which arise. Cost is a factor. For every topic, we need to employ a crowd of people to perform pair-wise voting. This restricts the number of models we can generate as our comparisons grow quadratically. Still, the expense could be reasonable dependent on application. The real issue is that the best system generated timeline might still be bad. With no gold standard, we have no real guarantee. This also has consequences for how we can apply the voting. If we would like to compare different TLG models then we would have to first implement them ourselves. This introduces a possible bias where alternative models may implemented poorly.## Proposed Contribution We therefore propose a framework where we use our numeric measures for exploratory analysis and model-comparison. Specifically we will use the ROUGE family of metrics as well as perplexity when first developing our model and in hyperparameter selection. Our evaluation process relies upon the existence of gold-standard timelines. As discussed in the literature review there is currently no dataset that matches this description. Our final contribution will to be to develop such a corpus. Once we have used automatic methods to determine our ideal model, we will use a crowd-vote to determine overall performance. This will take the form of a binary preferencing problem between a collection of system-generated and one gold-standard timeline. This hybrid approach provides the advantages of both numeric and crowd methods while mitigating their downsides. To date this is the first such method of evaluation for the TLG problem.