Understanding a Dataset: arXiv.org
The arXMLiv project by the KWARC research group at Jacobs University Bremen has been ongoing for almost a decade, dating back to 2006. I was lucky enough to enroll as a bachelor student at Jacobs during that same year, and got personally involved with arXMLiv in 2007.
The goal of arXMLiv (Stamerjohanns 2010) is to transform the sources of \(\approx 1\) million scientific papers from arXiv starting with the author-friendly syntax of TeX/LaTeX and ending with highly processible, machine-friendly, XHTML/HTML5 documents. Over the years we have become partners with the LaTeXML converter, which ambitiously aims at translating any TeX document into as good as possible web equivalent.
The big challenge in arXMLiv is the large-scale processing pass, running LaTeXML over each document of the arXiv dataset. The original framework was built by Dr. Heinrich Stamerjohanns (Stamerjohanns 2010) and has been used for most of the duration of the project. A couple of years back I undertook a rebuilding effort that gave rise to the CorTeX framework, attempting to unify the needs of arXMLiv with the domain-specific needs of math-rich NLP (see LLaMaPUn for more information).
In the course of building CorTeX, starting from a position of a junior PhD student with 0 experience in distributed processing, several big problems became painfully obvious:
Reusing a general-purpose distributed architecture always required trade-offs that could not fit with the realities of our hardware or administrative restrictions. There was a clear necessity to build our own custom-fit framework, while using as many standard components as possible, to minimize overhead and potential sources of bugs.
arXiv contains extremely irregular TeX data, guaranteeing that LaTeXML will break in all possible ways it can break. The distributed CorTeX workers which executed the translation needed several types of monitoring - timeouts, memory limits, and even external cleanup of runaway/zombie children (usually related to graphics conversion via ImageMagick and GhostScript).
Even when distributing workloads inside our own uni’s compute cluster (I was graciously provided 600 CPUs from HULK on an on-demand basis), there will be occasional network failures, especially related to server load.
In the process of building the distribution infrastructure, I have seen each component become the bottleneck - from doing complex joins on 100GB MySQL tables11in order to synthesize error-report summaries from the logs, through running out of server-side CPUs to receive the conversion results22The main CorTeX server has 10 CPUs at the moment, to hitting HDD write speed limitations when saving the returned batch results into the database.
My early conclusion was that I should rely on as many standard components and practices as possible, but the little custom code I had to write had the power of adding bottle-necks and instabilities with surprising ease. In the end of the day, the biggest problem remaining at the end of 2014 was the unpredictable size of each arXiv job, which lead to frequent RAM overflows in my job queues.
The purpose of this blog post is to correct an early error of not digging deeper into the properties of the arXiv dataset, and to record some of the relevant parameters for designing an adequate conversion framework.
Let’s get down to it.
While we used to have a private KWARC access channel to arXiv, we are now using the publicly available bulk access channel, which we have come to enjoy a lot. If you want to play around with it yourself, expect to have to download and unpack close to 450GB of data, as the sources include all supplementary data for the papers, such as images and bibliographies. As a general remark, downloading the entire
s3://arxiv/src channel from Amazon S3 will cost you just under \(\$50\), up to the May 2015 snapshot.
Unpacking and setting up the data could be a little tricky, especially if you’re only interested in papers with TeX sources. The open-source
CorTeX::Import module could give you an idea of how we worked things out.
The arXMLiv copy of arXiv’s TeX sources, containing all of arXiv up to and including May 2015, contains 955,591 papers with TeX sources, totaling just shy of a million.
Digging into the arXMLiv source dataset (i.e. the TeX papers from arXiv), we can count the number of papers in each monthly collection and plot them below. It’s interesting to observe that arXiv is following a rather clear linear growth pattern.