ABSTRACT
Researchers have access to large online archives of scientific articles. As a consequence, finding relevant papers has become more difficult. Newly formed online communities of researchers sharing citations provides a new way to solve this problem. In this paper, we develop an algorithm to recommend scientific articles to users of an online community. Our approach combines the merits of traditional collaborative filtering and probabilistic topic modeling. It provides an interpretable latent structure for users and items, and can form recommendations about both existing and newly published articles. We study a large subset of data from CiteULike, a bibliography sharing service, and show that our algorithm provides a more effective recommender system than traditional collaborative filtering. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—information filtering; I.2.6 [Artificial Intelligence]: Learning—Parameter learning General Terms Algorithms, Experimentation, Performance Keywords Scientific article recommendation, Topic modeling, Collaborative filtering, Latent structure interpretation.
1. INTRODUCTION
Modern researchers have access to large archives of scientific articles. These archives are growing as new articles are placed online and old articles are scanned and indexed. While this growth has allowed researchers to quickly access more scientific information, it has also made it more difficult for them to find articles relevant to their interests. Modern researchers need new tools for managing what is available to them. Historically, one way that researchers find articles is by following citations in other articles that they are interested in. This is an Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’11, August 21–24, 2011, San Diego, California, USA. Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00. effective practice—and one that we should continue—but it limits researchers to specific citation communities, and it is biased towards heavily cited papers. A statistician may miss a relevant paper in economics or biology because the two literatures rarely cite each other; and she may miss a relevant paper in statistics because it was also missed by the authors of the papers that she has read. One of the opportunities of online archives is to inform researchers about literature that they might not be aware of. A complementary method of finding articles is keyword search. This is a powerful approach, but it is also limited. Forming queries for finding new scientific articles can be difficult as a researcher may not know what to look for; search is mainly based on content, while good articles are also those that many others found valuable; and search is only good for directed exploration, while many researchers would also like a “feed” of new and interesting articles. Recently, websites like CiteULike1 and Mendeley2 allow researchers to create their own reference libraries for the articles they are interested in and share them with other researchers. This has opened the door to using recommendation methods [13] as a third way to help researchers find interesting articles. In this paper, we develop an algorithm for recommending scientific articles to users of online archives. Each user has a library of articles that he or she is interested in, and our goal is to match each user to articles of interest that are not in his or her library. We have several criteria for an algorithm to recommend scientific articles. First, recommending older articles is important. Users of scientific archives are interested in older articles for learning about new fields and understanding the foundations of their fields. When recommending old articles, the opinions of other users plays a role. A foundational article will be in many users’ libraries; a less important article will be in few. Second, recommending new articles is also important. For example, when a conference publishes its proceedings, users would like see the recommendations from these new articles to keep up with the state-of-the-art in their discipline. Since the articles are new, there is little information about which or how many other users placed the articles in their libraries, and thus traditional collaborative filtering methods has difficulties making recommendations. With new articles, a recommendation system must use their content. Finally, exploratory variables can be valuable in online scientific archives and communities. For example, we can summarize and describe each user’s preference profile based on the content of the articles that he or she likes. This lets us connect similar users to enhance the community, and indicate why we are connecting them. Further, we can describe articles in terms of what kinds of users like them. For example, we might detect that a machine learning 1http://
www.citeulike.org 2http://
www.mendeley.com article is of strong interest to computer vision researchers. If enough researchers use such services, these variables might also give an alternative measure of the impact of an article within a field. With these criteria in mind, we develop a machine learning algorithm for recommending scientific articles to users in an online scientific community. Our algorithm uses two types of data—the other users’ libraries and the content of the articles—to form its recommendations. For each user, our algorithm can finds both older papers that are important to other similar users and newly written papers whose content reflects the user’s specific interests. Finally, our algorithm gives interpretable representations of users and articles. Our approach combines ideas from collaborative filtering based on latent factor models [17, 18, 13, 1, 22] and content analysis based on probabilistic topic modeling [7, 8, 20, 2]. Like latent factor models, our algorithm uses information from other users’ libraries. For a particular user, it can recommend articles from other users who liked similar articles. Latent factor models work well for recommending known articles, but cannot generalize to previously unseen articles. To generalize to unseen articles, our algorithm uses topic modeling. Topic modeling provides a representation of the articles in terms of latent themes discovered from the collection. When used in our recommender system, this component can recommend articles that have similar content to other articles that a user likes. The topic representation of articles allows the algorithm to make meaningful recommendations about articles before anyone has rated them. We combine these approaches in a probabilistic model, where making a recommendation for a particular user is akin to computing a conditional expectation of hidden variables. We will show how the algorithm for computing these expectations naturally balances the influence of the content of the articles and the libraries of the other users. An article that has not been seen by many will be recommended based more on its content; an article that has been widely seen will be recommended based more on the other users. We studied our algorithm with data from CiteULike: 5, 551 users, 16, 980 articles, and 204, 986 bibliography entries. We will demonstrate that combining content-based and collaborative-based methods works well for recommending scientific articles. Our method provides better performance than matrix factorization methods alone, indicating that content can improve recommendation systems. Further, while traditional collaborative filtering cannot suggest articles before anyone has rated them, our method can use the content of new articles to make predictions about who will like them.
2. BACKGROUND
We first give some background. We describe two types of recommendation problems we address; we describe the classical matrix factorization solution to recommendation; and we review latent Dirichlet allocation (LDA) for topic modeling of text corpora.
2.1 Recommendation Tasks
The two elements in a recommender system are users and items. In our problem, items are scientific articles and users are researchers. We will assume I users and J items. The rating variable rij ∈ {0, 1} denotes whether user i includes article j in her library [12]. If it is in the library, this means that user i is interested in article j. (This differs from some other systems where users explicitly rate items on a scale.) Note that rij = 0 can be interpreted into two ways. One way is that user i is not interested in article j; the other is that user i does not know about article j. For each user, our task is to recommend articles that are not in her library but are potentially interesting. There are two types of Figure 1: Illustration of the two tasks for scientific article recommendation systems, where √ indicates “like”, × “dislike” and ? “unknown”. recommendation: in-matrix prediction and out-of-matrix prediction. Figure 1 illustrates the idea. In-matrix prediction. Figure 1 (a) illustrates in-matrix prediction. This refers to the problem of making recommendations about those articles that have been rated by at least one user in the system. This is the task that traditional collaborative filtering can address. Out-of-matrix prediction. Figure 1 (b) illustrates out-of-matrix prediction, where articles 4 and 5 have never been rated. (This is sometimes called “cold start recommendation.”) Traditional collaborative filtering algorithms cannot make predictions about these articles because those algorithms only use information about other users’ ratings. This task is important for online scientific archives, however, because users want to see new articles in their fields. A recommender system that cannot handle out-of-matrix prediction cannot recommend newly published papers to its users.