Tokenizing an article with LLaMaPUn

Welcome to LLaMaPUn!

The Cornell preprint arXiv contains roughly a million scientific papers, making it a treasure trove for natural language processing (NLP) experiments.

However, a big difference from mainstream NLP corpora is the presence of mathematical formulas, citations and other language modalities specific to scientific discourse. A second, and in practice just as significant challenge is that the majority of arXiv documents are authored in LaTeX, making them very irregular for naive automated mining.

At the KWARC research group at Jacobs University we have invested a lot of work in trying to regularize the arXiv dataset and make it available for NLP research, which is a large topic in its own right. I wrote an entry-level blog post about that effort here.

In this blog post, I want to briefly introduce the newest incarnation of the LLaMaPUn NLP library for scientific documents, backed up by a running example of word tokenization on an average preprint from the arXMLiv dataset.

Word frequencies in milliseconds

The repository ships with a word tokenization example, which produces basic word frequency statistics over an (assumed to be average) example paper from the arXMLiv dataset.

Running example

Here is a basic rundown of the example document, as produced by the tokenization script:

High level frequency overview for document 0903.1000
Type Frequency
Paragraphs 54
Sentences 137
Words 1563
Words (unique) 333
Formula ”words” 296
Citation ”words” 10

The concrete raw word frequencies are presented in Table \ref{table:rawfrequencies}. We visualize two basic views over the frequencies:

  1. 1.

    In Figure \ref{fig:freqinorder}: the frequency of words in order of their mention in the document

  2. 2.

    In Figure \ref{fig:freqdistribution}: the frequency distribution over words, which as expected already resembles the Pareto distribution.