Prediction of Distributed River Sediment Respiration Rates using
Community-Generated Data and Machine Learning
Abstract
River sediment microbial respiration is a key indicator of ecosystem
functioning and the biogeochemical fluxes across this critical zone link
surface and subsurface waters. As such, there is tremendous interest in
measuring and mapping these respiration rates. Respiration observations
are expensive and labor intensive; there is limited data available to
the community. An open science, collaborative initiative is collecting
samples for respiration rate analysis and multi-scale metadata; this
evolving data set is being used for making machine learning (ML)
predictions at unsampled sites to help inform continued community
engagement. However, it is a challenge to find an optimum configuration
for ML models to work with this feature-rich (i.e. 100+ possible input
variables) data set. Here, we present results from a two-tiered approach
to managing the analysis of this complex data set: 1) a stacked ensemble
of models that automatically optimizes hyperparameters and manages the
training of many models and 2) feature permutation importance to detect
the most important features in the models. The major elements of this
workflow are modular, portable, open, and cloud-based thus making this
implementation a potential template for other applications. The models
developed here predict that sediment organic matter chemistry is one of
the most important features for predicting sediment respiration rate.
Other larger-scale, important features fall into the categories of
climatic, ecological, geological, and fluvial settings. Leveraging these
larger-scale features to generate data-driven estimates of river
sediment respiration rates reveals spatially consistent but
heterogeneous patterns across the river network of the Columbia River
Basin.