Authorea

Alberto Pepe edited subsectionRegression_between_article_downloads.tex about 11 years ago

Commit id: ff250a643c4aeb697de271026d326b23f639ec54

deletions | additions

\subsection{Regression between article downloads, Twitter mentions, and citations} We investigate the degree by which article citations, denoted $C$, can be explained in terms of article-based Twitter mentions, denoted $T$, and arXiv downloads, denoted $A$, by means of a multi-variate linear regression analysis. This analysis is limited to a cohort of the 70 most mentioned articles on Twitter that were submitted to arXiv.org from October 4, 2010 to March 1, 2011 (5 months). This limitation is due to the extent of work involved in manually collecting early citation data as well as to the fact that a cohort of articles submitted earlier in the timeline can provide a fuller coverage of Twitter mentions and arXiv downloads. For each article, we retrieve the total number of Twitter mentions and arXiv downloads 60 days after submission, and their total number of early citation counts on September 30, 2011 (7 months later after submission of the latest paper). Given that each article could have been submitted at any time in a 5 month period, i.e. October 4, 2010 to March 1, 2011, on September 30, 2011 some articles could have had 5 more months than others to accumulate early citations. Therefore the citation counts observed on September 30, 2011 may be biased by the submission date of the article in question. We must therefore include the amount of time that an article has had to accumulate citations since their submission date as an independent variable in our regression models. Let $P$ represent the number of days between the submission time of the article and September 30, 2011. We thus define the following multivariate linear regression models: \begin{equation} C=\beta_{1}T+\beta_{1}P+\varepsilon \end{equation} \begin{equation} C=\beta_{1}A+\beta_{2}P+\varepsilon \end{equation} \begin{equation} C=\beta_{1}T+\beta_{2}A+\beta_{3}P+\varepsilon \end{equation} where $\beta_{i}$ denotes the corresponding regression coefficient. From Table 3, we observe that publication period $P$ is certainly a non-neglectable factor to predict the citation counts $C$ but also that Twitter mentions $T$ shows equally significant correlations. Moreover, Twitter mentions seem to be the most significant predictor of citations, compared to arXiv downloads and time since publication. This is not the case for arXiv downloads which, when accounting for Twitter mentions and arXiv downloads, do not exhibit a statistically significant relationship to early citations. In Figure 7 we show the bivariate scatterplots between Twitter mentions, arXiv downloads and citations. The corresponding Pearson's correlation coefficients are shown as well. Figure 7(b) and 7(c) again show that Twitter mentions are correlated with citations better than arXiv downloads, which matches our results obtained from multivariate linear regression analysis. In addition, Twitter mentions are also positively correlated with arXiv downloads as is shown in Figure 7(a), suggesting that the Twitter attention received by an article can be used to estimate its usage data, but usage, in turn, does not seem to correlated to early citations. Given the rather small sample size and the unequally distributed scatter, we performed a delete-1 observation jackknife on the Pearson's correlation coefficient between Twitter mentions and early citations (N=70). This yields a modified correlation value of 0.430 vs. the original value of 0.4516 indicating that the observed correlation is rather robust. However, dropping the top two frequently tweeted articles does reduce the correlation to 0.258 (p=0.016) implying that the observed correlation is strongest when frequently mentioned articles on Twitter are included, matching the results reported by \cite{Eysenbach_2011}.