Alberto Pepe edited materials.tex  about 11 years ago

Commit id: b1e22b9bab9d4ec52d2675792ac7608cb25d5232

deletions | additions      

       

\section{Materials}  \subsection{Data collection}  Our process of determining whether a particular arXiv article was mentioned on Twitter consists of three phases: crawling, filtering, and organization. Tweets are acquired via the Streaming API from Twitter Gardenhose, which represents roughly 10\% of the total tweets from public time line through random sampling. We collected tweets whose date and time stamp ranges from 2010-10-01 to 2011-04-30 which results in a sample of 1,959,654,862 tweets.  The goal of the data filtering process is to find all tweets that contain a URL that directly or indirectly links to any arXiv.org paper. However, determining whether a paper has or has not been mentioned on Twitter is fraught with a variety of issues, the most important of which is the prevalence of partial or shortened URLs. Twitter imposes a 140 character limit on the length of Tweets, and users therefore employ a variety of methods to replace the original article URLs with alternative or shortened ones. Since many different shortened URLs can point to the same original URLs, we resolve all shortened URLs in our Twitter data set to determine whether any of them point to the articles in our arXiv cohort.  We distinguish between four general types of scholarly mentions in Twitter, based on whether they contain:  \begin{enumerate}  \item a URL that directly refers to a paper published in arXiv.org.   \item a shortened URL that upon expansion refers to an arXiv.org paper  \item a URL that links to a web page, e.g.~a blog posting, which itself contains a URL that points to an arXiv.org paper.  \item a shortened URL that links to a type (3) mention after expansion.  \end{enumerate}  In order to detect these four types of Twitter mentions, we first expand all shortened URLs in our crawled public tweets. We select the top 16 popular URL shortening services, including bit.ly, tinyurl.com, and ow.ly, and expand the shortened URLs in our collection of tweets using their respective APIs. As such, we resolved 98,377,880 short URLs, which were mostly generated by the following URL shorteners: bit.ly (61.3\%), t.co (15.2\%), fb.me (6.5\%), tinyurl.com (6.1\%) and ow.ly (4.4\%). (We acknowledge that this procedure will not identify all Twitter mentions of a given arXiv.org paper, but it will however capture most.) From the resulting set, we retain all tweets that contain the term `arXiv' and at least one URL. Next, we associate tweets to arXiv papers by extracting the arXiv ID (substrings matching `dddd.dddd') from any papers mentioned in those tweets. (Note that in the case of the third and fourth type of Twitter mention the arXiv paper ID is not explicitly shown in the tweet itself, but needs to be extracted from the web pages that the tweet in question links to.)