loading page

A Bayesian model for quantifying errors in citizen science data: application to rainfall observations from Nepal
  • +1
  • Jessica A Eisma,
  • Gerrit Schoups,
  • Jeffrey Davids,
  • Nick Van de Giesen
Jessica A Eisma
Purdue University, Purdue University

Corresponding Author:jessica.eisma@uta.edu

Author Profile
Gerrit Schoups
Delft University of Technology, Delft University of Technology
Author Profile
Jeffrey Davids
California State University Chico, California State University Chico
Author Profile
Nick Van de Giesen
Delft University of Technology, Delft University of Technology
Author Profile


High quality citizen science can be instrumental in advancing science toward new discoveries and a deeper understanding of under-observed phenomena. However, the error structure of citizen scientist (CS) data must be well-defined. Within a citizen science program, the error types in submitted observations vary, and their occurrence may depend on a variety of CS-specific variables, such as motivation. This study develops a graphical Bayesian inference model of error types in CS data. The model assumes that: (1) each CS observation is subject to a specific error type, each with its own bias and noise; and (2) an observation’s error type depends on the error community of the CS, which in turn relates to characteristics of the CS submitting the observation. Given a set of CS observations and corresponding ground-truth values, the model can be calibrated for a specific application, yielding (i) number of error types and communities, (ii) bias and noise of each error type, (iii) error distribution of each community, and (iv) the community to which each CS belongs. The model, applied to Nepal CS rainfall observations, identifies seven error types and sorts CSs into four model-inferred communities. In the case study, 79% of CSs committed errors in fewer than 6.3% of their observations. The remaining tended to commit unit, meniscus, and unknown errors. A CS’s assigned community, coupled with the model-inferred error probability, can identify observations that require verification. With such a system, the onus of validating CS data is partially transferred from human effort to machine-learned algorithms.