POSTPRINT authorea.com/149

Abstract

This article was published as How the Scientific Community Reacts to Newly Submitted Preprints: Article Downloads, Twitter Mentions and Citations. Xin Shuai, Alberto Pepe, Johan Bollen. PLoS ONE 7(11): e47523. doi:10.1371/journal.pone.0047523. Open Access Article.

# Introduction

The view from the “ivory tower” is that scholars make rational, expert decisions on what to publish, what to read and what to cite. In fact, the use of citation statistics to assess scholarly impact is to a large degree premised on the very notion that citation data represent an explicit, objective expression of impact by expert authors (Rubin 2010). Yet, scholarship is increasingly becoming an online process, and social media are becoming an increasingly important part of the online scholarly ecology. As a result, the citation behavior of scholars may be affected by their increasing use of social media. Practices and considerations that go beyond traditional notions of scholarly impact may thus influence what scholars cite.

Recent efforts have investigated the effect of the use of social media environments on scholarly practice. For example, some research has looked at how scientists use the microblogging platform Twitter during conferences by analyzing tweets containing conference hashtags (Letierce 2010, Weller 2011). Other research has explored the ways by which scholars use Twitter and related platforms to cite scientific articles (Priem 2010, Weller 2011a). More recent work has shown that Twitter article mentions predict future citations (Eysenbach 2011). This article falls within, and extends, these lines of research by examining the temporal relations between quantitative measures of readership, Twitter mentions, and subsequent citations for a cohort of scientific preprints.

We study how the scientific community and the public at large respond to a cohort of preprints that were submitted to the arXiv database (http://arxiv.org), a service managed by Cornell University Library, which has become the premier pre-print publishing platform in physics, computer science, astronomy, and related domains. We examine the relations between three types of responses to the submissions of this cohort of pre-prints, namely the number of Twitter posts (tweets) that specifically mention these pre-prints, downloads of these pre-prints from the arXiv.org web site, and the number of early citations that the 70 most Twitter-mentioned preprints in our cohort received after their submission. In each case, we measure total volume of responses, as well as the delay and span of their temporal distribution. We perform a comparative analysis of how these indicators are related to each other, both in magnitude and time.

Our results indicate that download and social media responses follow distinct temporal patterns. Moreover, we observe a statistically significant correlation between social media mentions and download and citation count. These results are highly relevant to recent investigations of scholarly impact based on social media data (Priem 2010a, Priem 2011) as well as to more traditional efforts to enhance the assessment of scholarly impact from usage data (Bollen 2009, Bollen 2008, Brody 2006, Kurtz 2010).

# Data and study overview

## Data collection

Our analysis is based on a corpus of 4,606 scientific articles submitted to the preprint database arXiv between October 4, 2010 and May 2, 2011. For each article in this cohort, we gathered information about their downloads from the arXiv server weekly download logs, their daily number of mentions on Twitter using a large-scale collection of Twitter data collected over that period, and their early citations in the scholarly record from Google Scholar. Table 1 summarizes the discussed data collection and Figure 1 provides an overview of the data collection timelines.

The datasets employed in this study are:

• ArXiv downloads: For each article in the aforementioned cohort we retrieved their weekly download numbers from the arXiv logs for the period from October 4, 2010 to May 9, 2011. A total of 2,904,816 downloads were recorded for 4,606 articles.

• Twitter mentions: Our collection of tweets is based on the Gardenhose, a data feed that returns a randomly sampled 10% of all daily tweets. A Twitter mention of arXiv article was deemed to have occurred when a tweet contained an explicit or shortened link to an arXiv paper (see “Materials” appendix for more details). Between October 4, 2010 and May 9, 2011 we scanned 1,959,654,862 tweets in which 4,415 articles out of 4,606 in our cohort were mentioned at least once, i.e. approximately 95% of the cohort. Such a wide coverage of arXiv articles is mostly due to specialized bot accounts which post arXiv submissions daily. The volume of Twitter mentions of arXiv papers was very small compared to the total volume of tweets in period, with only 5,752 tweets containing mentions of papers in the arXiv corpus. We found that 2,800 out of 5,752 tweets are from non-bot accounts. After filtering out all tweets posted by bot accounts, we retain 1,710 arXiv articles out of 4,415 that are mentioned on Twitter by non-bot accounts. Including or excluding bot mentions, the distribution of number of tweets over all papers was very skewed; most papers were mentioned only once, but one paper in the corpus was mentioned as much as 113 times.

• Early citations: We manually retrieved citation counts from Google Scholar for the 70 most Twitter-mentioned articles in our cohort. Citation counts were retrieved on September 30, 2011 and date back to the initial submission date in arXiv. All 70 articles combined were cited a total of 431 times at that point. The most cited article in the corpus was cited 62 times whereas most articles received hardly any citations.

By the nature of our research topic, we are particularly focused on early responses to preprint submissions, i.e., immediate, swift reactions in the form of downloads, Twitter mentions, and citations. Therefore, we record download statistics and Twitter mention data only one week over the submission period itself (up to May 9, 2011).

As for citation data, we are aware that citations take years to accrue. We do not explore here long-term citation effects, but only the early, immediate response to pre-print submission in the form of citations in the scholarly record. Our citation data pertains to a time period that spans from 5 months to 1 year: it is a fraction of the expected amount of “maturation time” for citation analysis. Citation data must therefore be considered to reflect “early citations” only, not total potential citations.

## Definitions: delay and time span.

Twitter mentions and arXiv downloads may follow particular temporal patterns. For example, for some articles downloads and mentions may take weeks to slowly increase after submission, whereas for other articles downloads may increase very swiftly after submission to wane very shortly thereafter. The total number of downloads and mentions is orthogonal to these temporal effects, and could be different in either case.

The two parameters that we use to describe the temporal distributions of arXiv downloads and Twitter mentions are delay and the time span, which we define as follows. Let $$t_0 \in \mathbb{N}^+$$ be the date of submission for arti