this is for holding javascript data
Deyan Ginev added section_arXMLiv_Base_Stats_While__.tex
almost 9 years ago
Commit id: bb64f67e2ae7f271716afdb9a5d53dff7a9ae38c
deletions | additions
diff --git a/section_arXMLiv_Base_Stats_While__.tex b/section_arXMLiv_Base_Stats_While__.tex
new file mode 100644
index 0000000..0c30652
--- /dev/null
+++ b/section_arXMLiv_Base_Stats_While__.tex
...
\section{arXMLiv Base Stats}
While we used to have a private KWARC access channel to arXiv, we are now using the publicly available \href{http://arxiv.org/help/bulk_data_s3}{bulk access channel}, which we have come to enjoy a lot. Expect to have to download and unpack close to \textbf{450GB} of data\footnote{As a general remark, downloading the entire \verb|s3://arxiv/src| channel from Amazon S3 will cost you just under $\$50$.}, as the sources include all supplementary data for the papers, such as images and bibliographies.
Unpacking and setting up the data could be a little tricky, especially if you're only interested in papers with TeX sources. The open-source ```CorTeX::Import``` module could give you an idea of how we worked things out.
\subsection{Dataset Size}
The arXMLiv copy of arXiv's TeX sources, containing all of arXiv up to and including May 2015, contains \textbf{955,591} papers with TeX sources, totaling just shy of a million.