Invited Talk: Math-rich Natural Language Processing (NLP) on Billion Token Corpora

loading page

Deyan Ginev

Abstract

Analyzing mathematical natural language has a high entry barrier, due to challenges of licensing, representation, and processing at scale.

Additionally, the interplay between the modality of mathematical symbolism and natural language often requires redesign of any existing state-of-the-art solutions. Examples are core problems in the field of computational linguistics, which are largely considered solved for newswire and biomedical texts, such as part-of-speech (POS) tagging and named entity recognition (NER).

A third outstanding core challenge is the lack of availability of “gold standard” datasets, traditionally used for the training and evaluation of learned models and analysis techniques. Here again, the family of classic annotation tasks needs to be extended to mathematical expressions.

In this talk I will share the experience the KWARC research group has had in working on these problems over the last decade, and suggest potential next steps to ensure open collaborations and reproducible and verifiable results in the domain of math-rich NLP.

Our main corpus of investigation has been Cornell’s e-Print archive, arXiv.org. The HTML5 conversion of arXiv tentatively contains over \(4.2\) billion words and over a third of a billion formulas, found in the paragraphs of just under a million scientific articles, as of March 2016.