A Bayesian model for quantifying errors in citizen science data:
application to rainfall observations from Nepal
Abstract
High quality citizen science can be instrumental in advancing science
toward new discoveries and a deeper understanding of under-observed
phenomena. However, the error structure of citizen scientist (CS) data
must be well-defined. Within a citizen science program, the error types
in submitted observations vary, and their occurrence may depend on a
variety of CS-specific variables, such as motivation. This study
develops a graphical Bayesian inference model of error types in CS data.
The model assumes that: (1) each CS observation is subject to a specific
error type, each with its own bias and noise; and (2) an observation’s
error type depends on the error community of the CS, which in turn
relates to characteristics of the CS submitting the observation. Given a
set of CS observations and corresponding ground-truth values, the model
can be calibrated for a specific application, yielding (i) number of
error types and communities, (ii) bias and noise of each error type,
(iii) error distribution of each community, and (iv) the community to
which each CS belongs. The model, applied to Nepal CS rainfall
observations, identifies seven error types and sorts CSs into four
model-inferred communities. In the case study, 79% of CSs committed
errors in fewer than 6.3% of their observations. The remaining tended
to commit unit, meniscus, and unknown errors. A CS’s assigned community,
coupled with the model-inferred error probability, can identify
observations that require verification. With such a system, the onus of
validating CS data is partially transferred from human effort to
machine-learned algorithms.