this is for holding javascript data
Deyan Ginev edited section_arXMLiv_Base_Stats_While__.tex
almost 9 years ago
Commit id: 66c0151f42453c9e5d68bbcd2a4dd359d9dd8e00
deletions | additions
diff --git a/section_arXMLiv_Base_Stats_While__.tex b/section_arXMLiv_Base_Stats_While__.tex
index 57af4b8..b2a1fe9 100644
--- a/section_arXMLiv_Base_Stats_While__.tex
+++ b/section_arXMLiv_Base_Stats_While__.tex
...
While we used to have a private KWARC access channel to arXiv, we are now using the publicly available \href{http://arxiv.org/help/bulk_data_s3}{bulk access channel}, which we have come to enjoy a lot. If you want to play around with it yourself, expect to have to download and unpack close to \textbf{450GB} of data\footnote{As a general remark, downloading the entire \verb|s3://arxiv/src| channel from Amazon S3 will cost you just under $\$50$, up to the May 2015 snapshot.}, as the sources include all supplementary data for the papers, such as images and bibliographies.
Unpacking and setting up the data could be a little tricky, especially if you're only interested in papers with TeX sources. The open-source
```CorTeX::Import``` \verb|CorTeX::Import| module could give you an idea of how we worked things out.
\subsection{Dataset Size}
The arXMLiv copy of arXiv's TeX sources, containing all of arXiv up to and including May 2015, contains \textbf{955,591} papers with TeX sources, totaling just shy of a million.