loading page

Hidden Stories: Topic Modeling in Hydrology Literature
  • Mashrekur Rahman,
  • Grey Nearing,
  • Jonathan Frame
Mashrekur Rahman
University of Alabama

Corresponding Author:[email protected]

Author Profile
Grey Nearing
NASA Goddard Space Flight Center
Author Profile
Jonathan Frame
University of Alabama
Author Profile


Recent advancement of computational linguistics, machine learning, including a variety of toolboxes for Natural Language Processing (NLP), help facilitate analysis of vast electronic corpuses for a multitude of objectives. Research papers published as electronic text files in different journals offer windows into trending topics and developments, and NLP allows us to extract information and insight about these trends. This project applies Latent Dirichlet Allocation (LDA) Topic Modeling for bibliometric analyses of all abstracts in selected high-impact (Impact Factor > 0.9) journals in hydrology. Topic modeling uses statistical algorithms to extract semantic information from a collection of texts and has become an emerging quantitative method to assess substantial textual data. The resulting generated topics are interpretable based on our prior knowledge of hydrology and related sub-disciplines. Comparative topic trend, term, and document level cluster analyses based on different time periods was performed. These analyses revealed topics such as climate change research gaining popularity in Hydrology over the last decade. An inter-topic correlation analysis also revealed the nature of information exchange and absorption between various communities within the hydrology domain. The primary objective of this work is to allow researchers to explore new branches and connections in the Hydrology literature, and to facilitate comprehensive and inclusive literature reviews. We aim to use these results combined with probability distribution between topics, journals and authors to create an ontology that is useful for scientists and environmental consultants for exploring relevant literature based on topics and topic relationships.