Markup noise

Another aspect to warn against is the complete lack of markup quality control in arXiv, which inevitably leads to a variety of errors that add ”noise” to the dataset. As the range of structures in LaTeXis broader than plaintext, these errors are also more diverse. Examples are:

  • Regular plaintext typos

  • Inconsistent use of unicode characters in names (e.g. ”Poincare” vs ”Poincaré”)

  • Mathematical expressions left in text mode, as well as textual expressions added as formulas for stylistic convenience(such as 1$^{st}$ or using $\bullet$ to denote an \item ).

  • Citations and references are inconsistently automated (via \ref and \cite ), while sometimes left in as explicit numbers.

  • An old but quite detailed rough list of such issues can be found here, and probably needs to be revived in a separate document.