Figure 2: The graphical model for the CTR model.
Given a collection, the posterior distribution (or maximum likelihood estimate) of the topics reveals the K topics that likely generated its documents. Unlike a clustering model, where each document is assigned to one cluster, LDA allows documents to exhibit multiple topics. For example, LDA can capture that one article might be about biology and statistics, while another might be about biology and physics. Since LDA is unsupervised, the themes of “physics” “biology” and “statistics” can be discovered from the corpus; the mixed-membership assumptions lead to sharper estimates of word co-occurrence patterns. Figure 2: The graphical model for the CTR model. Given a corpus of documents, we can use variational EM to learn the topics and decompose the documents according to them [7]. Further, given a new document, we can use variational inference to situate its content in terms of the topics. Our goal is to use topic modeling to give a content-based representation of items in a recommender system.
3. COLLABORATIVE TOPIC REGRESSION
In this section, we describe the collaborative topic regression (CTR) model. CTR combines traditional traditional collaborative filtering with topic modeling. A first approach to combining collaborative filtering and topic modeling is to fit a model that uses the latent topic space to explain both the observed ratings and the observed words. For example, we can use the topic proportion θj in place of the latent item latent vector vj in Eq. 3, rij ∼ N (u T i θj , c −1 ij ). (5) (We note that [19] proposed a similar approach to Eq. 5, but based on correlated topic models [4]. It showed modest improvement over matrix factorization on several movie recommendation datasets.) This model suffers from the limitation that it cannot distinguish topics for explaining recommendations from topics important for explaining content. Consider two articles A and B that are both about machine learning applied to social networks. They are similar and, therefore, have similar topic proportions θA and θB. Now further suppose that these articles are interesting to different kinds of users: Article A might give an interesting machine learning algorithm that is applied to social network applications; article B uses standard machine learning techniques, but gives an important piece of data analysis on social network data. Users that work in machine learning will prefer article A and rarely consider article B; users that work in social networks will prefer the opposite. However, using the topic proportions as in Eq. 5 will be likely to make similar recommendations for both articles to both types of users. Collaborative topic regression can detect this difference—that one type of user likes the first article and another type likes the second. As above, collaborative topic regression (CTR) represents users with topic interests and assumes that documents are generated by a topic model. CTR additionally includes a latent variable j that offsets the topic proportions θj when modeling the user ratings. As more users rate articles, we have a better idea of what this offset is. This offset variable can explain, for example, that article A is more interesting to machine learning researchers than it is to social network analysis researchers. How much of the prediction relies on content and how much it relies on other users depends on how many users have rated the article. Figure 2 shows the graphical model. Again, assume there are K topics β = β1:K. The generative process of CTR is as follows, 1. For each user i, draw user latent vector ui ∼ N (0, λ−1 u IK). 2. For each item j, (a) Draw topic proportions θj ∼ Dirichlet(α). (b) Draw item latent offset j ∼ N (0, λ−1 v IK) and set the item latent vector as vj = j + θj . (c) For each word wjn, i. Draw topic assignment zjn ∼ Mult(θ). ii. Draw word wjn ∼ Mult(βzjn ). 3. For each user-item pair (i, j), draw the rating rij ∼ N (u T i vj , c −1 ij ). (6) The key property in CTR lies in how the item latent vector vj is generated. Note that vj = j + θj , where j ∼ N (0, λ−1 v Ik), is equivalent to vj ∼ N (θj , λ−1 v IK), where we assume the item latent vector vj is close to topic proportions θj , but could diverge from it if it has to. Note that the expectation of rij is a linear function of θj , E[rij |ui, θj , j ] = u T i (θj + j ). This is why we call the model collaborative topic regression. Learning the parameters. Given topic parameter β, computing the full posterior of ui, vj and θj is intractable. We develop an EMstyle algorithm to learn the maximum a posteriori (MAP) estimates. Maximization of the posterior is equivalent to maximizing the complete log likelihood of U, V , θ1:J , and R given λu, λv and β, L = − λu 2 P i u T i ui − λv 2 P j (vj − θj ) T (vj − θj ) (7) + P j P n log P k θjkβk,wjn − P i,j cij 2 (rij − u T i vj ) 2 . We have omitted a constant and set α = 1. We optimize this function by coordinate ascent, iteratively optimizing the collaborative filtering variables {ui, vj} and the topic proportions θj . For ui and vj , maximization follows in a similar fashion as for basic matrix factorization [12]. Given the current estimate of θj , taking the gradient of L with respect to ui and vj and setting it to zero leads to (recall the matrix definition U = (ui) I i=1 and V = (vj ) J j=1) ui ← (V CiV T + λuIK) −1 V CiRi (8) vj ← (UCjU T + λvIK) −1 (UCjRj + λvθj ). (9) where Ci is a diagonal matrix with cij , j = 1 · · · , J as its diagonal elements and Ri = (rij ) J j=1 for user i. For item j, Cj and Rj are similarly defined. Eq. 9 shows how topic proportions θj affects item latent vector vj , where λv balances this effect. Finally, we note that the complexity is linear in the number of articles in the users’ libraries. This follows from the special structure of cij defined in Eq. 4. (See [12] for details.) Given U and V , we now describe how to learn the topic proportions θj . 3 We first define q(zjn = k) = φjnk. Then we separate the items that contain θj and apply Jensen’s inequality, L(θj ) ≥ −λv 2 (vj − θj ) T (vj − θj ) + P n P k φjnk log θjkβk,wjn − log φjnk = L(θj , φj ). (10) Let φj = (φjnk) N×K n=1,k=1. The optimal φjnk satisfies φjnk ∝ θjkβk,wjn . (11) The L(θj , φj ) gives the tight lower bound of L(θj ). We cannot optimize θj analytically, so we use projection gradient [3]. We use 3On our data, we found that simply fixing θj as the estimate from vanilla LDA gives comparable performance and saves computation. coordinate ascent to optimize the remaining parameters, U, V , θ1:J and φ1:J . After we estimate U, V and φ, we can optimize β, βkw ∝ P j P n φjnk1[wjn = w]. (12) Note this is the same M-step update for topics as in LDA [7]. Prediction. After all the (locally) optimal parameters U ∗ , V ∗ , θ ∗ 1:J and β ∗ are learned, the CTR model can be used for both inmatrix and out-of-matrix prediction. Let D be the observed data, in general each prediction is estimated as E[rij |D] ≈ E[ui | D] T (E[θj | D] + E[j | D]). (13) For in-matrix prediction, we use the point estimate of ui, θj and j to approximate their expectations, r ∗ ij ≈ (u ∗ i ) T (θ ∗ j + ∗ j ) = (u ∗ i ) T v ∗ j , (14) where recall that vj = θj + j . In out-of-matrix prediction the article is new, and no other ratings are available. Thus, E[j ] = 0 and we predict with r ∗ ij ≈ (u ∗ i ) T θ ∗ j . (15) To obtain the topic proportions θ ∗ j for a new article, we optimize Eq. 10. The first term is dropped because vj = θj . Related work. Several other work uses content for recommendation [15, 14, 1, 2]. Among these, the closest work to ours is fLDA by [2]. FLDA generalizes the supervised topic model (sLDA) [6], using the empirical topic proportions z¯j = (1/N) PN n=1 zjn as well as several other covariates to form predictions. In our settings, where we do not have additional covariates, their approach is roughly akin to setting vj = θj . We show in Section 4 that a similar setting does not perform as well as the CTR model because it largely ignores the other users ratings. Other recent work considers the related problem of using topic modeling to predict legislative votes [21, 10]. Neither of these methods introduces offset terms to account for votes (i.e., ratings). Legislative votes might be an interesting application for the CTR model.
4. EMPIRICAL STUDY
We demonstrate our model by analyzing a real-world community of researchers and their citation files.
Dataset. Our data are users and their libraries of articles obtained from CiteULike.5 At CiteUlike, registered users create personal reference libraries; each article usually has a title and abstract. (The other information about the articles, such as the authors, publications and keywords, is not used in this paper.) We merged duplicated articles, removed empty articles, and removed users with fewer than 10 articles to obtain a data set of 5, 551 users and 16, 980 articles with 204, 986 observed user-item pairs. (This matrix has a sparsity of 99.8%; it is highly sparse.) On average, each user has 37 articles in the library, ranging from 10 to 403. 93% of the users have fewer than 100 articles. For each article, we concatenate its title and abstract. We remove stop words and use tf-idf to choose the top 8, 000 distinct words as the vocabulary [5]. This yielded a corpus of 1.6M words. These articles were added to CiteULike between 2004 and 2010. On 4A demo of the results can be found at
http://www.cs.princeton.edu/˜chongw/citeulike/ 5
http://www.citeulike.org/faq/data.a average, each article appears in 12 users’ libraries, ranging from 1 to 321. 97% of the articles appear in fewer than 40 libraries.
Evaluation. In our experiments, we will analyze a set of articles and user libraries. We will evaluate recommendation algorithms on sets of held-out articles and ratings. We will (hypothetically) present each user with M articles sorted by their predicted rating and evaluate based on which of these articles were actually in each user’s library. Two possible metrics are precision and recall. However, as we discussed earlier, zero ratings are uncertain. They may indicate that a user does not like an article or does not know about it. This makes it difficult to accurately compute precision. Rather, since ratings of rij = 1 are known to be true positives, we focus on recall. Recall only considers the positively rated articles within the top M—a high recall with lower M will be a better system. For each user, the definition of recall@M is recall@M = (number of articles the user likes in top M) /(total number of article the user likes) . The recall for the entire system can be summarized using the average recall from all users. The recall above we defined is user-oriented. We also consider article-oriented recall for testing the system’s predictive performance on a particular article. For article j, we consider the population of users that like the article and the proportion of those for whom that article appears in their top M recommended articles. This evaluates the predictive power of the system on a chosen set of articles. As we discussed in section 2, we consider two recommendation tasks users, in-matrix prediction and out-of-matrix prediction.
In-matrix prediction. In-matrix prediction considers the case where each user has a set of articles that she has not seen, but that at least one other user has seen. We ask the question, how good is each system at rating that set of articles for each user? As discussed in section 2.1, this task is similar to traditional collaborative filtering. We split the data into a training set and test set, ensuring that all articles in the test set have appeared in the training set. Content information is not required to perform recommendations—though we will see that it helps—and thus matrix factorization can be used. We use 5-fold cross-validation. For every article that appears at least 5 times in the users’ libraries, we evenly split their user-item pairs (both 1’s and 0’s) into 5 folds. We iteratively consider each fold to be a test set and the others to be the training set. For those articles that appear fewer than 5 times, we always put them into the training set. This guarantees that all articles in the test set must appear in the training set. (9% of the articles are always in the training set, since they appear fewer than 5 times.) For each fold, we fit a model to the training set and test on the within-fold articles for each user. (Note: each user has a different set of within-fold articles.) We form predictive ratings for the test set, and generate a list of the top M recommended articles.
Out-of-matrix prediction. Out-of-matrix prediction considers the case where a new set of articles is published and no one has seen them. Again we ask, how good is each system at rating that set of articles for each user? We again use 5-fold cross validation. First, we evenly group all articles into 5 folds. For each fold, we fit the model to the submatrix formed by the out-of-fold articles and then test the recommendations for each user on the within-fold articles. Note that in this case, each user has the same set of within-fold articles and we are guaranteed that none of these articles is in the training set for any user. Again,we form predictive ratings for the test set, and generate a list of the top M recommended articles. These two experimental set-ups—in-matrix and out-of-matrix predictions—are designed to be comparable—the top M articles are computed from the same size of candidate populations.