ROUGH DRAFT authorea.com/109159

# Introduction

We first introduce the problem and examine several domains where it has been examined. We then address the two primary research questions that arise when comparing different methods: how best to model and evaluate timelines. In each of these two areas we organise the key papers thematically. The contributions of each paper are synthesised into a general discussion about the pros and cons of each approach.

## The Problem

Timeline generation (TLG) is a way of representing a large amount of temporally dependent information concisely. It is query driven; we retrieve a corpus of text linked to some entity, event or other term. The canonical TLG model clusters these articles into topics or stories, selects the most important of these clusters and returns timestamped summaries. It can be seen as a generalisation of the multi-document summarisation task, where we have introduced temporal dependency and structure.

Clustering: Our model handles both current and historical articles. Several domains where the TLG model is desirable (e.g. news) are characterised by a large historical catalog and high frequency. Our clustering model has to be scaleable and ideally functional in a streaming context.

Topic models: Incorporating a topic structure into our model gives us an intuitive and understandable document representation. Looking at the topic distribution also lets us prioritise certain kinds of sentences or stories.

Timelines: TLG differs from regular multi-document summarisation in its temporal dependence. As such, we value diversity in both the kind of content selected as well as when it was published.

Summarisation: While clustering and topic modeling are useful structuring tools, we must still compress the data for human readability.

## Applications

Timeline generation has been applied in a range of domains. Several approaches focus on specific sub-problems in TLG. Althoff et al. timelines entities in the Freebase repository (Althoff 2015). This knowledge base is comprised of subject and objects linked by predicates; 'Robert Downey Junior' might be linked by 'Starred In' to 'The Avengers. The TLG model seeks to group and select a subset of these subject-predicates that summarise our query entity. Working from knowledge-base instead of a corpus of text provides useful structure. We are spared the task of preprocessing, entity linking, consolidation and data validation on our text corpora. Of course, this relies on the existence of a knowledge base of timestamped facts about our query entity. This is an issue if we want to be able to perform queries on lesser known entities. A similar approach was used by Ahmed et al. on a corpus of academic papers (Ahmed 2012).

Single-document variations of the TLG model have been applied to Wikipedia and Wikinews articles (Bauer 2015, Minard 2015). The single-document nature of the task is characterised by a focus on the summarisation and selection tasks over clustering. Timelines took the form of timestamped subsets of the document sentences. A similar approach has been applied to Twitter feeds; Hong et al. applied a single-document version of the TLG model to a user's tweet-history (Hong 2011).

The most frequent application of TLG models is to a corpus of news articles (Chieu 2004, Hong 2011, Yan 2011, Yan 2011a, Allan 2001, Swan 2000, Ahmed 2011). A document retrieval service is given a query term; a person, event, company or generic term. The body of documents retrieved form the input for the TLG task. The task then is to group and summarise a potentially large body of articles and provide a compact representation of the query term.

# Research Methods in Timeline Modelling

Several approaches have been used to model timelines and the TLG problem. They can be roughly divided into optimisation based approaches and probabilstic approaches. Our focus and the majority of our analysis is on the latter. They are the more prevalent approach and have several highly desirable attributes in terms of post-hoc querying and representation.

## Optimisation Based Approaches

Optimisation based approaches develop a function that measures sentence importance. This has to account for relevancy and diversity -- both in terms of content and temporality. Once this function has been formulated, inference occurs through maximisisation.

### Simple Optimisation

Several approaches formulate TLG as a simple optimisation problem. Althoff et al. and Chieu et al. both define objective functions on the sentence level which include signals of relevance and diversity (Althoff 2015, Chieu 2004). Numeric methods are used to find the sentences maximising this objective. This approach forgoes the topic and clustering structure present in other methodologies. The advantage is that model construction and timeline generation is relatively simple. Without any additional structure on the data it can be hard to determine a valid and representative objective function. This structure can also be useful in and of itself, discussed further in the subsequent sections.

### PageRank Framework

In a similar manner some academics have applied a PageRank-esque framework to TLG (Yan 2011, Yan 2011a, Tran 2015). These models operate through maximising an objective function, but are characterised by an underlying graph structure. Sentences are linked to one another and ranked based on the 'votes' or 'preferences' they receive from their neighbours. Often inter- and intra- document sentences are represented and weighted differently to improve performance (Wan 2007). Temporal dependence is induced through the particulars of graph construction; Yan et al. for example models this component through intra- and inter- date edges (Yan 2011, Yan 2011a). The PageRank approach is more structured than a simple optimisation approach, and comes at the cost of additional model building complexity. The tradeoff is that the graph-structure is argued as being an intuitive method for representing the problem (Wan 2007).

## Simple Probabilistic Models

The earliest attempts at approaching the TLG problem were simple probablistic models. Allan et al.'s approach was a first attempt at generalising sentence selection to include temporal dependence (Allan 2001). The model ranked documents by defining a density function on a sentence level measuring the probability it was "useful and novel". As a simplifying assumption, usefulness and novelness were assumed independent, that is $$P\left(\text{useful\landnovel}\right) = P\left(\text{useful}\right)P\left(\text{novel}\right)$$. The features were modeled through simple counting statistics and sentence similarities -- a 'novel' document was dissimilar to documents appearing earlier. Inference involved emitting all sentences with $$P\left(\text{useful\landnovel}\right) > \theta$$ for some parameter $$\theta$$.

The advantage of this model is that we immediately retrieve some measure of uncertainty. Optimisation based approaches could emit uninformative sentences that are maximum by their cost function. A well implemented probability based approach indicates not just the most desirable sentence but also some interpretable measure of ipso-factor relevance.

## Bayesian Models

Several authors model TLG as a Bayesian probability network (Ahmed 2011, Hong 2011, Wang 2013, Ahmed 2012, Li 2014). In all cases, a hierarchal topic model is placed over the documents. There have been both parametric and non-parametric approaches. The models also differ in how they handle temporality and clustering.

### Topic Models

Topics are represented as a distribution over the words in the corpus. A document in turn is represented as a mixture of these topics. In the hierarchal model, topics are modeled in a tree-structure. The closer two topics in the tree, the more similar their word-distributions. Topics closer to the root are broad, and descendants in the tree are sub-topics. For example, the story 'Robert Downey Junior stars in the Avengers' might be modeled as a mixture of 50% topic 'Robert Downey Junior' and 50% topic 'Avengers'. In the topic-tree perhaps 'Robert Downey Junior' is a child of 'Actors' and we can trace a path up the tree from 'The Avengers' to 'Comic Book Movies' to 'Movies'.

### Parametric and Non-parametric Approaches

A non-parametric Bayesian model is characterised by the property that one or more of its parameters are determined by the data. Contrastingly a parametric model is defined by the absence of this property. For example, in a parametric $$k$$-means clustering algorthim $$k$$, the number of clusters, would be manually specified. A non-parametric $$k$$-means on the other hand would infer the number of clusters from the data itself. The hierachal topic models implementing TLG can be classified by two levels of parametricism. Ahmed et al. and Hong et al. used a hierarchal LDA model (Ahmed 2011, Hong 2011). Both the number of topics as well as the depth of the topic-tree are parameters that have to be manually specified. On the other hand Ahmed et al., Li et al. and Wang all use models with a nonparametric number of topics (Wang 2013, Ahmed 2012, Li 2014). Ahmed et al. and Wang's model has a nonparametric tree-depth: Li et al does not.

The choice of parametisation is a tradeoff. Nonparametric models don't require specifying data-dependent parameters. They therefore require less dataset dependent tuning. This makes a nonparametric model more general-purpose. The number of stories present in our corpus should also increase as a function of time. This makes the use of a parametric model a little awkward. We could base the number of topics on the end-state of the model. However, this would result in extra overhead in the early epochs of our training when the true number of topics is small. Furthermore, this all assumes we have a 'final epoch' at all. In the streaming version of the TLG problem, it is not apparent what fixed number would constitute a reasonable amount of topics. The additional flexibility of nonparametric models come at the expense of more work in the inference phase. When we replace fixed parameters with densities we increases our model dimensionality. This is mitigated on a few counts. In the parametric case our parameters still have to be determined somehow. This generally occurs through cross validation or a parameter optimisation process. This overhead could easily outweigh the extra time at inference in the non-parametric case, especially as in most cases we only replace a handful of parameters.

### Inference in Bayesian TLG Models

The result of the models above is a probability density. The catch is that unlike with simpler models, the integral representing our density is intractable. One solution makes use of the fact that while intergrating over the whole measure space is difficult, plugging in values and retrieving an (un-normalised) probability is not. We can exploit this property by performing an intelligent random walk over the surface of the density. The idea is that if we walk for long enough, we'll obtain a reasonable representation of the surface. This is the basis for a family of inference methods called Markov-Chain Monte-Carlo (MCMC) sampling.

All current Bayesian TLG formulations use MCMC based inference methods (Ahmed 2011, Hong 2011, Wang 2013, Ahmed 2012, Li 2014). They are simple to implement and an intuitive method for exploring an unknown density. On the other hand they tend to scale poorly with both data-set size or dimensionality (Wainwright 2008, Grimmer 2010). NLP in general tends to have large amounts of sparse high dimensional data. Furthermore TLG is a summarisation task and the value of summarisation grows with the size of the underlying data. Because of this, exploring additional inference methods is an important goal for further research.

One alternative to sampling-based methods is variational inference. In broad strokes this method defines a family of densities capable of approximating the generating distribution of any given dataset. We then perform an iterative optimisation process to find the distribution in this family that best matches our data. Variational inference has been shown to generate models of similar quality to MCMC methods several order of magnitudes faster (Wainwright 2008, Grimmer 2010).

No implementations of TLG to date use this form of inference. Variational methods are highly specific to the underlying model, and developing a new variational inference formulation is an involved task. Nevertheless they could provide a large increase in scalability. For the sake of comparison, we can observe the work done by Wang et al. and Bryant et al. (Blei 20