Critique of Mining Topics in Documents: Standing on the Shoulders of Big Data
In this paper I will be reviewing and evaluating the work of Zhiyuan Chen and Bing Liu of the University of Chicago, entitled Mining Topics in Documents: Standing on the Shoulders of Big Data, presented at the 2014 Conference of Knowledge Discovery and Data Mining. After a brief summary of the paper and its findings, I present the author's background and related previous work - to find that they have pioneered transfer-learning models for topic mining. Subsequenlty I conduct a brief overview of related work and possible improvements, followed by my personal reflections and suggested future work pathways. I finish with a brief conclusion.
Keywords: topic modeling, lifelong learning, transfer learning, AMC, LDA
In this paper I will be reviewing and evaluating the work of Zhiyuan Chen and Bing Liu of the University of Chicago, entitled Mining Topics in Documents: Standing on the Shoulders of Big Data (MTD), presented at the 2014 Conference of Knowledge Discovery and Data Mining (Chen 2014).
MTD is a methodological paper that aims to improve and extend the field of topic modeling (i.e. automatically discovering and clustering the topics in an existing a text snippet, called a document - the typical outcome is a set of topics, each uniquely defined and described by a distribution of words). The main reason that fueled their work is that the current established method for topic modeling, Latent Dirichlet Allocation (LDA, with the corresponding algorithm called Gibbs-sampling (Blei 2003)) requires a substantial amount of to yield good results. LDA needs to be feed the number of topics desired to be found in a certain text in advance. MTD proposes a "lifelong learning model", akin to "learning as humans do": by recording existing knowledge about topic and word prevalence in knowns set of documents and connecting the newcomers documents, their topics and their corresponding words to these. This is compared to the "dos and don'ts" of human learning be associating certain tasks with their context and determining whether it is socially or physically acceptable to perform them. In MTD this appears as a list of must-link word pairs that always appear in the description of the same topic and as another list of cannot-link word pairs that never appear in the description of the same topic. Based on this logic, MTD names its algorithm AMC - Automatically generated Cannot-links and Must-links. Experimental results in the paper show that AMC outperforms the current best implementation of LDA.
Classic topic models, such as LDA or PLSA (Probabilistic Latent Semantic Analysis) need thousands of documents to provide reliable topic information. But in practice, the number of documents available for analysis is at most 100 - consider comments or reviews, news articles, etc. There a few possible improvement pathways, but most are infeasible or impractical such as increasing the number of input documents or provide human input for prior domain knowledge. Then another improvement pathway would be to transfer information across domains - as suggested by AMC. This works, because every topic domain will have similar characteristics: for gadget comments, all of them will have price or battery life, books will have length and so on.
The AMC algorithm runs LDA on a set of documents from a variety of domains. This yields a set of topics - with possible (but not probable) overlapping topics across documents. This is labeled as the prior topic set. Then the AMC must-link and cannot-link parts are run on a new document set, using the information embedded in the distributions of the prior topic set. A must link between two words means that they belong to the same topic across a number of prior topics. Cannot-links are words which never occur concurrently in a prior topic. The links are generated using a multi-dimensional generalized Pó