Authorea

Markup noise

Another aspect to warn against is the complete lack of markup quality control in arXiv, which inevitably leads to a variety of errors that add ”noise” to the dataset. As the range of structures in LaTeXis broader than plaintext, these errors are also more diverse. Examples are:

•

Regular plaintext typos
•

Inconsistent use of unicode characters in names (e.g. ”Poincare” vs ”Poincaré”)
•

Mathematical expressions left in text mode, as well as textual expressions added as formulas for stylistic convenience(such as 1$^{st}$ or using $\bullet$ to denote an \item ).
•

Citations and references are inconsistently automated (via \ref and \cite ), while sometimes left in as explicit numbers.
•

An old but quite detailed rough list of such issues can be found here, and probably needs to be revived in a separate document.