Alberto Pepe edited subsectionData_collection_Our_analysis.tex  about 11 years ago

Commit id: eb39d7e017e98034de55ee64d6c7c7f0f102c407

deletions | additions      

       

\section{Data and study overview}  \subsection{Data collection}  Our analysis is based on a corpus of 4,606 scientific articles submitted to the preprint database arXiv between October 4, 2010 and May 2, 2011. For each article in this cohort, we gathered information about their downloads from the arXiv server weekly download logs, their daily number of mentions on Twitter using a large-scale collection of Twitter data collected over that period, and their early citations in the scholarly record from Google Scholar. Table 1 summarizes the discussed data collection and Figure 1 provides an overview of the data collection timelines.  The datasets employed in this study are:  \begin{itemize}  \item \textbf{ArXiv downloads}: For each article in the aforementioned cohort we retrieved their weekly download numbers from the arXiv logs for the period from October 4, 2010 to May 9, 2011. A total of 2,904,816 downloads were recorded for 4,606 articles.   %  \item \textbf{Twitter mentions}: Our collection of tweets is based on the Gardenhose, a data feed that returns a randomly sampled 10\% of all daily tweets. A Twitter mention of arXiv article was deemed to have occurred when a tweet contained an explicit or shortened link to an arXiv paper (see ``Materials'' appendix for more details). Between October 4, 2010 and May 9, 2011 we scanned 1,959,654,862 tweets in which 4,415 articles out of 4,606 in our cohort were mentioned at least once, i.e. approximately 95\% of the cohort. Such a wide coverage of arXiv articles is mostly due to specialized bot accounts which post arXiv submissions daily. The volume of Twitter mentions of arXiv papers was very small compared to the total volume of tweets in period, with only 5,752 tweets containing mentions of papers in the arXiv corpus. We found that 2,800 out of 5,752 tweets are from non-bot accounts. After filtering out all tweets posted by bot accounts, we retain 1,710 arXiv articles out of 4,415 that are mentioned on Twitter by non-bot accounts. Including or excluding bot mentions, the distribution of number of tweets over all papers was very skewed; most papers were mentioned only once, but one paper in the corpus was mentioned as much as 113 times.  \item \textbf{Early citations}: We manually retrieved citation counts from Google Scholar for the 70 most Twitter-mentioned articles in our cohort. Citation counts were retrieved on September 30, 2011 and date back to the initial submission date in arXiv. All 70 articles combined were cited a total of 431 times at that point. The most cited article in the corpus was cited 62 times whereas most articles received hardly any citations.  \end{itemize}  By the nature of our research topic, we are particularly focused on \textit{early} responses to preprint submissions, i.e., immediate, swift reactions in the form of downloads, Twitter mentions, and citations. Therefore, we record download statistics and Twitter mention data only one week over the submission period itself (up to May 9, 2011).   As for citation data, we are aware that citations take years to accrue. We do not explore here long-term citation effects, but only the early, immediate response to pre-print submission in the form of citations in the scholarly record. Our citation data pertains to a time period that spans from 5 months to 1 year: it is a fraction of the expected amount of ``maturation time'' for citation analysis. Citation data must therefore be considered to reflect ``early citations'' only, not total potential citations.