Authorea

Deyan Ginev added section_arXMLiv_Base_Stats_While__.tex almost 9 years ago

Commit id: bb64f67e2ae7f271716afdb9a5d53dff7a9ae38c

deletions | additions

\section{arXMLiv Base Stats} While we used to have a private KWARC access channel to arXiv, we are now using the publicly available \href{http://arxiv.org/help/bulk_data_s3}{bulk access channel}, which we have come to enjoy a lot. Expect to have to download and unpack close to \textbf{450GB} of data\footnote{As a general remark, downloading the entire \verb|s3://arxiv/src| channel from Amazon S3 will cost you just under $\$50$.}, as the sources include all supplementary data for the papers, such as images and bibliographies. Unpacking and setting up the data could be a little tricky, especially if you're only interested in papers with TeX sources. The open-source ```CorTeX::Import``` module could give you an idea of how we worked things out. \subsection{Dataset Size} The arXMLiv copy of arXiv's TeX sources, containing all of arXiv up to and including May 2015, contains \textbf{955,591} papers with TeX sources, totaling just shy of a million.