Authorea

Xavier Holt edited Bayesian_Topic_Models_Bayesian_Formulation__.md almost 8 years ago

Commit id: c97f593d64017b733fe0073a7b52207181ec8d99

deletions | additions

## Bayesian Topic Models ### Bayesian Formulation # Representing the TLG Problem Bayesian Formulation: As the TLG Bayesian-formulation is not the primary focus of this paper, we adopt the formulation of Ahmed et al. \cite{Ahmed2011}. It is a general framework general which has been applied to a range of domains. Topic-model over the sentences and words in each document. The time-dependent hierarchal dirichlet process (t-HDP) underlying the model is non-parametric in the number of topics. ### Topic Models Topics are represented as a distribution over the words in the corpus. A document in turn is represented as a mixture of these topics. In the abstract, the story 'Robert Downey Junior stars in the Avengers' might be comprised of 50% topic 'Robert Downey Junior' and 50% topic 'Avengers'. A non-parametric Bayesian model is characterised by the property that one or more of its parameters are determined by the data. For example, in a parametric \(k\)-means clustering algorthim \(k\), the number of clusters, would be manually specified. A non-parametric \(k\)-means on the other hand would infer the number of clusters from the data itself. Ahmed et al. and Hong et al. used a hierarchal LDA model \cite{Ahmed2011, Hong:2011du}. Non-para \cite{Wang2013, Ahmed:2012vh, Li2013} ### Parametric and Non-parametric Approaches A non-parametric Bayesian model is characterised by the property that one or more of its parameters are determined by the data. Contrastingly a parametric model is defined by the absence of this property. For example, in a parametric \(k\)-means clustering algorthim \(k\), the number of clusters, would be manually specified. A non-parametric \(k\)-means on the other hand would infer the number of clusters from the data itself. The hierachal topic models implementing TLG can be classified by two levels of parametricism. Ahmed et al. and Hong et al. used a hierarchal LDA model \cite{Ahmed2011, Hong:2011du}. Both the number of topics as well as the depth of the topic-tree are parameters that have to be manually specified. On the other hand Ahmed et al., Li et al. and Wang all use models with a nonparametric number of topics \cite{Wang2013, Ahmed:2012vh, Li2013}. Ahmed et al. and Wang's model has a nonparametric tree-depth: Li et al does not. The choice of parametisation is a tradeoff. Nonparametric models don't require specifying data-dependent parameters. They therefore require less dataset dependent tuning. This makes a nonparametric model more general-purpose. The number of stories present in our corpus should also increase as a function of time. This makes the use of a parametric model a little awkward. We could base the number of topics on the end-state of the model. However, this would result in extra overhead in the early epochs of our training when the true number of topics is small. Furthermore, this all assumes we have a 'final epoch' at all. In the streaming version of the TLG problem, it is not apparent what fixed number would constitute a reasonable amount of topics. The additional flexibility of nonparametric models come at the expense of more work in the inference phase. When we replace fixed parameters with densities we increases our model dimensionality. This is mitigated on a few counts. In the parametric case our parameters still have to be determined somehow. This generally occurs through cross validation or a parameter optimisation process. This overhead could easily outweigh the extra time at inference in the non-parametric case, especially as in most cases we only replace a handful of parameters. ## Components: **Clustering**: Our model handles both current and historical articles. Several domains where the TLG model is desirable (e.g. news) are characterised by a large historical catalog and high frequency. Our clustering model has to be scaleable and ideally functional in a streaming context.

**Summarisation**: While clustering and topic modeling are useful structuring tools, we must still compress the data for human readability. ### Topic Models Topics are represented as a distribution over the words in the corpus. A document in turn is represented as a mixture of these topics. In the abstract, the story 'Robert Downey Junior stars in the Avengers' might be comprised of 50% topic 'Robert Downey Junior' and 50% topic 'Avengers'.