Another aspect to warn against is the complete lack of markup quality control in arXiv, which inevitably leads to a variety of errors that add ”noise” to the dataset. As the range of structures in LaTeXis broader than plaintext, these errors are also more diverse. Examples are:
Regular plaintext typos
Inconsistent use of unicode characters in names (e.g. ”Poincare” vs ”Poincaré”)
Mathematical expressions left in text mode, as well as textual expressions added as formulas for stylistic convenience(such as 1$^{st}$
or using $\bullet$
to denote an \item
).
Citations and references are inconsistently automated (via \ref
and \cite
), while sometimes left in as explicit numbers.
An old but quite detailed rough list of such issues can be found here, and probably needs to be revived in a separate document.