Authorea

Simple Probabilistic Models

The earliest attempts at approaching the TLG problem were simple probablistic models. Allan et al.'s approach was a first attempt at generalising sentence selection to include temporal dependence \cite{Allan:2001bx}. The model ranked documents by defining a density function on a sentence level measuring the probability it was "useful and novel". As a simplifying assumption, usefulness and novelness were assumed independent, that is $P\left(\text{useful$\land$novel}\right) = P\left(\text{useful}\right)P\left(\text{novel}\right)$. The features were modeled through simple counting statistics and sentence similarities -- a 'novel' document was dissimilar to documents appearing earlier. Inference involved emitting all sentences with $P\left(\text{useful$\land$novel}\right) > \theta$ for some parameter $\theta$.

The advantage of this model is that we immediately retrieve some measure of uncertainty. Optimisation based approaches could emit uninformative sentences that are maximum by their cost function. A well implemented probability based approach indicates not just the most desirable sentence but also some interpretable measure of ipso-factor relevance.

Bayesian Models

Several authors model TLG as a Bayesian probability network \cite{Ahmed2011, Hong:2011du, Wang2013, Ahmed:2012vh, Li2013}. In all cases, a hierarchal topic model is placed over the documents. There have been both parametric and non-parametric approaches. The models also differ in how they handle temporality and clustering.

Topic Models

Topics are represented as a distribution over the words in the corpus. A document in turn is represented as a mixture of these topics. In the hierarchal model, topics are modeled in a tree-structure. The closer two topics in the tree, the more similar their word-distributions. Topics closer to the root are broad, and descendants in the tree are sub-topics. For example, the story 'Robert Downey Junior stars in the Avengers' might be modeled as a mixture of 50% topic 'Robert Downey Junior' and 50% topic 'Avengers'. In the topic-tree perhaps 'Robert Downey Junior' is a child of 'Actors' and we can trace a path up the tree from 'The Avengers' to 'Comic Book Movies' to 'Movies'.

Parametric and Non-parametric Approaches

A non-parametric Bayesian model is characterised by the property that one or more of its parameters are determined by the data. Contrastingly a parametric model is defined by the absence of this property. For example, in a parametric $k$-means clustering algorthim $k$, the number of clusters, would be manually specified. A non-parametric $k$-means on the other hand would infer the number of clusters from the data itself. The hierachal topic models implementing TLG can be classified by two levels of parametricism. Ahmed et al. and Hong et al. used a hierarchal LDA model \cite{Ahmed2011, Hong:2011du}. Both the number of topics as well as the depth of the topic-tree are parameters that have to be manually specified. On the other hand Ahmed et al., Li et al. and Wang all use models with a nonparametric number of topics \cite{Wang2013, Ahmed:2012vh, Li2013}. Ahmed et al. and Wang's model has a nonparametric tree-depth: Li et al does not.

The choice of parametisation is a tradeoff. Nonparametric models don't require specifying data-dependent parameters. They therefore require less dataset dependent tuning. This makes a nonparametric model more general-purpose. The number of stories present in our corpus should also increase as a function of time. This makes the use of a parametric model a little awkward. We could base the number of topics on the end-state of the model. However, this would result in extra overhead in the early epochs of our training when the true number of topics is small. Furthermore, this all assumes we have a 'final epoch' at all. In the streaming version of the TLG problem, it is not apparent what fixed number would constitute a reasonable amount of topics. The additional flexibility of nonparametric models come at the expense of more work in the inference phase. When we replace fixed parameters with densities we increases our model dimensionality. This is mitigated on a few counts. In the parametric case our parameters still have to be determined somehow. This generally occurs through cross validation or a parameter optimisation process. This overhead could easily outweigh the extra time at inference in the non-parametric case, especially as in most cases we only replace a handful of parameters.

Inference in Bayesian TLG Models

The result of the models above is a probability density. The catch is that unlike with simpler models, the integral representing our density is intractable. One solution makes use of the fact that while intergrating over the whole measure space is difficult, plugging in values and retrieving an (un-normalised) probability is not. We can exploit this property by performing an intelligent random walk over the surface of the density. The idea is that if we walk for long enough, we'll obtain a reasonable representation of the surface. This is the basis for a family of inference methods called Markov-Chain Monte-Carlo (MCMC) sampling.

All current Bayesian TLG formulations use MCMC based inference methods \cite{Ahmed2011, Hong:2011du, Wang2013, Ahmed:2012vh, Li2013}. They are simple to implement and an intuitive method for exploring an unknown density. On the other hand they tend to scale poorly with both data-set size or dimensionality \cite{wainwright2008graphical, Grimmer_2010}. NLP in general tends to have large amounts of sparse high dimensional data. Furthermore TLG is a summarisation task and the value of summarisation grows with the size of the underlying data. Because of this, exploring additional inference methods is an important goal for further research.

One alternative to sampling-based methods is variational inference. In broad strokes this method defines a family of densities capable of approximating the generating distribution of any given dataset. We then perform an iterative optimisation process to find the distribution in this family that best matches our data. Variational inference has been shown to generate models of similar quality to MCMC methods several order of magnitudes faster \cite{wainwright2008graphical, Grimmer_2010}.

No implementations of TLG to date use this form of inference. Variational methods are highly specific to the underlying model, and developing a new variational inference formulation is an involved task. Nevertheless they could provide a large increase in scalability. For the sake of comparison, we can observe the work done by Wang et al. and Bryant et al. \cite{Wang2011, Bryant2012}. They analyse the performance of variational inference methods for hierarchical Dirichlet processes. This process is the foundation of our nonparametric topic models, justifying the comparison. Wang et al. employ variational methods on datasets of over 400,000 articles\cite{Wang2011}. In contrast the largest TLG model with MCMC inference to date has only had a dataset of 10,000 articles\cite{Wang2013}.