A Bayesian model for quantifying errors in citizen science data:
application to rainfall observations from Nepal
Abstract
High quality citizen science data can be instrumental in advancing
science toward new discoveries and a deeper understanding of
under-observed phenomena. However, the error structure of citizen
scientist (CS) data must be well-defined. Within a citizen science
program, the errors in submitted observations vary, and their occurrence
may depend on CS-specific characteristics. This study develops a
graphical Bayesian inference model of error types in CS data. The model
assumes that: (1) each CS observation is subject to a specific error
type, each with its own bias and noise; and (2) an observation’s error
type depends on the error community of the CS, which in turn relates to
characteristics of the CS submitting the observation. Given a set of CS
observations and corresponding ground-truth values, the model can be
calibrated for a specific application, yielding (i) number of error
types and error communities, (ii) bias and noise for each error type,
(iii) error distribution of each error community, and (iv) the error
community to which each CS belongs. The model, applied to Nepal CS
rainfall observations, identifies five error types and sorts CSs into
four model-inferred communities. In the case study, 73% of CSs
submitted data with errors in fewer than 5% of their observations. The
remaining CSs submitted data with unit, meniscus, and unknown errors. A
CS’s assigned community, coupled with model-inferred error
probabilities, can identify observations that require verification. With
such a system, the onus of validating CS data is partially transferred
from human effort to machine-learned algorithms.