Authorea

Xavier Holt edited Introduction_and_contribution_1_page__.md almost 8 years ago

Commit id: 7f984b0f734314ed5c3881367d0afc2e9de80ebc

deletions | additions

We first introduce the problem and examine several domains where it has been examined. We then address the two primary research questions that arise when comparing different methods: how best to model and evaluate timelines. In each of these two areas we organise the key papers thematically. The contributions of each paper are synthesised into a general discussion about the pros and cons of each approach. ### The Problem Timeline generation (TLG) is a way of representing a large amount of temporally dependent information concisely. It is query driven; we retrieve a corpus of text linked to some entity, event or other term. The canonical TLG model clusters these articles into topics or stories, selects the most important of these clusters and returns timestamped summaries. It can be seen as a generalisation of the multi-document summarisation task, where we have introduced temporal dependency and structure. **Clustering**: Our model handles both current and historical articles. Several domains where the TLG model is desirable (e.g. news) are characterised by a large historical catalog and high frequency. Our clustering model has to be scaleable and ideally functional in a streaming context. **Topic models**: Incorporating a topic structure into our model gives us an intuitive and understandable document representation. Looking at the topic distribution also lets us prioritise certain kinds of sentences or stories. **Timelines**: TLG differs from regular multi-document summarisation in its temporal dependence. As such, we value diversity in both the kind of content selected as well as when it was published. **Summarisation**: While clustering and topic modeling are useful structuring tools, we must still compress the data for human readability. ### Applications Timeline generation has been applied in a range of domains. Several approaches focus on specific sub-problems in TLG. Althoff et al. timelines entities in the Freebase repository \cite{Althoff:2015dg}. This knowledge base is comprised of subject and objects linked by predicates; 'Robert Downey Junior' might be linked by 'Starred In' to 'The Avengers. The TLG model seeks to group and select a subset of these subject-predicates that summarise our query entity. Working from knowledge-base instead of a corpus of text provides useful structure. We are spared the task of preprocessing, entity linking, consolidation and data validation on our text corpora. Of course, this relies on the existence of a knowledge base of timestamped facts about our query entity. This is an issue if we want to be able to perform queries on lesser known entities. A similar approach was used by Ahmed et al. on a corpus of academic papers \cite{Ahmed:2012vh}.