Research Methods in Timeline Evaluation

Timeline evaluation is still an open problem. This can be seen as a function of the task difficulty. Evaluation is subjective. Different readers may prefer one timeline over another based on writing style, length or the types of events covered. Even timelines with the same atomic events and semantic content can vary. Rewording, re-ordering or selecting similar sentences from different sources could all result in a large number of equally valid timelines. Any method of evaluation has to be sufficiently robust. The methods present in the literature so far largely fall in the following camps:

Evaluation from a gold standard.
Dataset annotation.
Preference voting.
Predictive power.

Gold Standard

This approach generates a timeline that is in some sense 'correct' or 'best'. Different system generated timelines are then compared against the standard. The closer or more similar by some measure to the standard, the better the generated timeline is.

Gold Standard Generation

One common approach uses an existing gold standard. Yan et al sources six editorial timelines of key events from a range of sources \cite{Yan2011, Yan2011a}. In a similar manner, Chieu et al. uses a list of major earthquakes curated by Reuters editorial staff \cite{Chieu:2004id}. System timelines of the subjects are generated and evaluated. The first issue with this approach is whether if whether the model conditions accurately reflect an actual application setting. In general we would go from a query to a timeline. By instead starting from an existing timeline and working backwards it we decrease the difficulty of the problem. In both of these papers, the dataset for timeline generation was chosen after the gold standards were found. Chieu et al. found a Reuters timeline and so used Reuters articles in their dataset. A more realistic implementation of the model wouldn't have the benefit of pre-selection. Another issue is of generalisability. If we can only evaluate timelines with an existing gold standard, we have no guarantee of model performance outside of our query-domains. For example, one use-case might be to generate summarised timelines for a Twitter user's stream. It's not apparent that gold standards for this kind of problem exist. A news and Twitter timeline might look very different; any model evaluated on the former might perform poorly on the latter. While it's true that this is a general criticism of supervised learning, it is particularly relevant here because in several of the examples below we can readily generate additional test data as needed.

Another approach used either experts or crowd-workers to manually generate gold standards. Ahmed et al. worked on the TLG subtask of determining when a topic began and ended \cite{Ahmed:2012vh}. Their dataset was a collection of NIPS conference proceedings. They chose eight topics (such as 'kernel') and manually annotated when these appeared for the first and last times. Crowd-workers have also been used to generate gold standards. Wang used Amazon's Mechanical Turk with sixteen workers for each epoch to manually generate timelines for a given query \cite{Wang2013}. These approaches avoid the issues outlined above: they generate a standard given a query (not vice-versa), and can be used on a wide range of domains. There are a number of caveats, the most obvious being cost. We rely either on the availability of an expert or use of the crowd. This places an additional overhead on evaluation, but dependent on the task this may be acceptable. Perhaps more concerning is the question of how we use crowd-workers. In Ahmed et al.'s approach, the standard was clear; a beginning and end date for a given topic. The task in Wang's case is not so simple. We wish to source a timeline from a crowd. Timeline generation as we've seen is a difficult task with many different components and dependencies. Implementing this with a non-expert crowd comes with a host of difficulties in terms of crowd requirements, source restriction and consolidation. We do not believe Wang makes a convincing argument for how these problems are to be solved.

Gold Standard Evaluation

Once a gold standard has been found, there are several approaches used to evaluate a candidate timeline. Several TLG approaches adopt a ROUGE metric for this task \cite{Wang2013,Yan2011,Yan2011a}. These are a family of metrics used for summarisation evaluation. In our case, our reference for comparison is the gold standard timelines. ROUGE is an imperfect automatic process which has been criticised for not providing a good measure of content quality \cite{reiter2009investigation}. An alternative is preference voting. Chieu et al. uses a crowd-vote approach \cite{Chieu:2004id}. They present crowd-workers with both the gold standard and a system-generated timeline. The workers then preference which of the two they prefer. This approach is more costly than automatic evaluation. On the other hand, a timeline is intended to be human readable so having a reader evaluate its quality is appealing. In the section below we cover crowd-based approaches without a gold standard in more detail.

It is worth noting that both of these evaluation methods are only as good as the gold-standard which generates them. ROUGE is a metric of distributional similarity to the gold standard. As such,a poor gold standard can bound the performance of any model based on it. On the other hand in a binary comparative crowd-vote, including poorer gold standards would result in the system generated timeline being preferenced more. As such it is important to argue a gold standard is appropriate before using it in evaluation.

Dataset Annotation

Another method of evaluation uses an annotated dataset.

Yahoo! News is used as a dataset in a number of TLG implementations \cite{Ahmed2011, Hong:2011du, Yan2011, Yan2011a}. The Yahoo! corpus includes pairwise annotations indicating if two articles are 'must-link' or 'cannot-link. This corresponds to if a pair of articles belong to the same story or not. TDT-2 is another dataset used in several TLG models \cite{Swan:2000dy,Allan:2001bx}. TDT-2 is a corpus of 60,000 articles. It contains two-hundred topics , and each article is annotated with either on- or off- topic for each.

Evaluation in all cases where these datasets have been employed is simple. We compare the clustering performance of the system timeline against these annotations. In the Yahoo! case we examine how many 'must-link' pairs we cluster together. Similarly when TDT-2 is used clustering is evaluated on the basis of how closely our clusters match the given topics.

These annotated datasets are useful for several reasons. No gold standard has to be generated, and the evaluation process is incredibly easy. However there are several critical issues which restrict and provision their use. The annotations are only useful for evaluating the clustering component of our models. This is a critical sub-task in TLG, but distinct from selection or summarisation. As such these annotations can only be used to evaluate a small part of the problem. Annotation for the selection and summarisation task is in theory possible but would be a massive undertaking.

The articles in TDT-2 were originally published in 1998. The way news has been written and published has changed dramatically since then. Furthermore, both datasets are comprised of news articles. In contrast, a Twitter timeline would be built on a more sparse dataset with far smaller documents. If we wished to apply the TLG problem to medium such as Tweets, there is no guarantee this evaluation process is applicable.

Comparative voting.

The last method for timeline evaluation is a crowd-vote. Several candidate system timelines are generated and evaluated by how many preferred votes they receive. This differs from the above crowd approach in that it is a preference among system timelines: there is no gold standard.

Althoff et al. adopts this approach by presenting timelines for comparisons to Mechanical Turk workers \cite{Althoff:2015dg}. This approach is nice in a few ways. It is applicable to timelines generated from a wide range of domains. A voter could just as easily prefrence a timeline based on tweets, news stories or biographies. Some domain-knowledge might be required dependent on application, but this is an issue with crowd-work generally. Even if an individual doesn't possess knowledge of the topic, they can still preference based on other features such as redundancy, length or clarity. There are nevertheless some issues which arise. Cost is a factor. For every topic, we need to employ a crowd of people to perform pair-wise voting. This restricts the number of models we can generate as our comparisons grow quadratically. Still, the expense could be reasonable dependent on application. The real issue is that the best system generated timeline might still be bad. With no gold standard, we have no real guarantee. This also has consequences for how we can apply the voting. If we would like to compare different TLG models then we would have to first implement them ourselves. This introduces a possible bias where alternative models may implemented poorly.