Authorea

Deyan Ginev edited section_arXMLiv_Base_Stats_While__.tex almost 9 years ago

Commit id: 66c0151f42453c9e5d68bbcd2a4dd359d9dd8e00

deletions | additions

While we used to have a private KWARC access channel to arXiv, we are now using the publicly available \href{http://arxiv.org/help/bulk_data_s3}{bulk access channel}, which we have come to enjoy a lot. If you want to play around with it yourself, expect to have to download and unpack close to \textbf{450GB} of data\footnote{As a general remark, downloading the entire \verb|s3://arxiv/src| channel from Amazon S3 will cost you just under $\$50$, up to the May 2015 snapshot.}, as the sources include all supplementary data for the papers, such as images and bibliographies. Unpacking and setting up the data could be a little tricky, especially if you're only interested in papers with TeX sources. The open-source ```CorTeX::Import``` \verb|CorTeX::Import| module could give you an idea of how we worked things out. \subsection{Dataset Size} The arXMLiv copy of arXiv's TeX sources, containing all of arXiv up to and including May 2015, contains \textbf{955,591} papers with TeX sources, totaling just shy of a million.