ROUGH DRAFT authorea.com/92721
Main Data History
Export
Show Index Toggle 0 comments
  •  Quick Edit
  • Tokenizing an arXiv.org article with LLaMaPUn

    Welcome to LLaMaPUn!

    The Cornell preprint arXiv contains roughly a million scientific papers, making it a treasure trove for natural language processing (NLP) experiments.

    However, a big difference from mainstream NLP corpora is the presence of mathematical formulas, citations and other language modalities specific to scientific discourse. A second, and in practice just as significant challenge is that the majority of arXiv documents are authored in LaTeX, making them very irregular for naive automated mining.

    At the KWARC research group at Jacobs University we have invested a lot of work in trying to regularize the arXiv dataset and make it available for NLP research, which is a large topic in its own right. I wrote an entry-level blog post about that effort here.

    In this blog post, I want to briefly introduce the newest incarnation of the LLaMaPUn NLP library for scientific documents, backed up by a running example of word tokenization on an average preprint from the arXMLiv dataset.

    Word frequencies in milliseconds

    The repository ships with a word tokenization example, which produces basic word frequency statistics over an (assumed to be average) example paper from the arXMLiv dataset.

    Running example

    Here is a basic rundown of the example document, as produced by the tokenization script:

    High level frequency overview for document 0903.1000
    Type Frequency
    Paragraphs 54
    Sentences 137
    Words 1563
    Words (unique) 333
    Formula ”words” 296
    Citation ”words” 10

    The concrete raw word frequencies are presented in Table \ref{table:rawfrequencies}. We visualize two basic views over the frequencies:

    1. 1.

      In Figure \ref{fig:freqinorder}: the frequency of words in order of their mention in the document

    2. 2.

      In Figure \ref{fig:freqdistribution}: the frequency distribution over words, which as expected already resembles the Pareto distribution.

    \label{fig:freqinorder}Word frequencies in order of first word occurrence in the source document.