Authorea

Denes Csala edited Summary.md over 8 years ago

Commit id: b291f083fa3d3b72cc3dc568b103ae97d5801b78

deletions | additions

In this paper I will be reviewing and evaluating the work of Zhiyuan Chen and Bing Liu of the University of Chicago, entitled _Mining Topics in Documents: Standing on the Shoulders of Big Data_(MTD), presented at the 2014 Conference of Knowledge Discovery and Data Mining. Mining \cite{Chen_2014}. MTD is a methodological paper that aims to improve and extend the field of topic modeling (i.e. automatically discovering and clustering the topics in an existing a text snippet, called a document - the typical outcome is a set of topics, each uniquely defined and described by a distribution of words). The main reason that fueled their work is that the current established method for topic modeling, Latent Dirichlet Allocation (LDA, with the corresponding algorithm called Gibbs-sampling \cite{Blei:2003:LDA:944919.944937}) requires a substantial amount of to yield good results. LDA needs to be feed the number of topics desired to be found in a certain text _in advance_. MTD proposes a "lifelong learning model", akin to "learning as humans do": by recording existing knowledge about topic and word prevalence in knowns set of documents and connecting the newcomers documents, their topics and their corresponding words to these. This is compared to the "_do_s and _don't_s" of human learning be associating certain tasks with their context and determining whether it is socially or physically acceptable to perform them. In MTD this appears as a list of _must-link_ word pairs that always appear in the description of the same topic and as another list of _cannot-link_ word pairs that never appear in the description of the same topic. Based on this logic, MTD names its algorithm _AMC - Automatically generated Cannot-links and Must-links_. Experimental results in the paper show that AMC outperforms the current best implementation of LDA.