Wen-Ping Tsai

and 4 more

Some machine learning (ML) methods such as classification trees are useful tools to generate hypotheses about how hydrologic systems function. However, data limitations dictate that ML alone often cannot differentiate between causal and associative relationships. For example, previous ML analysis suggested that soil thickness is the key physiographic factor determining the storage-streamflow correlations in the eastern US. This conclusion is not robust, especially if data are perturbed, and there were alternative, competing explanations including soil texture and terrain slope. However, typical causal analysis based on process-based models (PBMs) is inefficient and susceptible to human bias. Here we demonstrate a more efficient and objective analysis procedure where ML is first applied to generate data-consistent hypotheses, and then a PBM is invoked to verify these hypotheses. We employed a surface-subsurface processes model and conducted perturbation experiments to implement these competing hypotheses and assess the impacts of the changes. The experimental results strongly support the soil thickness hypothesis as opposed to the terrain slope and soil texture ones, which are co-varying and coincidental factors. Thicker soil permits larger saturation excess and longer system memory that carries wet season water storage to influence dry season baseflows. We further suggest this analysis could be formalized into a novel, data-centric Bayesian framework. This study demonstrates that PBM present indispensable value for problems that ML cannot solve alone, and is meant to encourage more synergies between ML and PBM in the future.

Kai Ma

and 7 more

There is a drastic geographic imbalance in available global streamflow gauge and catchment property data, with additional large variations in data characteristics, so that models calibrated in one region cannot normally be migrated to another. Currently in these regions, non-transferable machine learning models are habitually trained over small local datasets. Here we show that transfer learning (TL), in the sense of weights initialization and weights freezing, allows long short-term memory (LSTM) streamflow models that were trained over the Conterminous United States (CONUS, the source dataset) to be transferred to catchments on other continents (the target regions), without the need for extensive catchment attributes. We demonstrate this possibility for regions where data are dense (664 basins in the UK), moderately dense (49 basins in central Chile), and where data are scarce and only globally-available attributes are available (5 basins in China). In both China and Chile, the TL models significantly elevated model performance compared to locally-trained models. The benefits of TL increased with the amount of available data in the source dataset, but even 50-100 basins from the CONUS dataset provided significant value for TL. The benefits of TL were greater than pre-training LSTM using the outputs from an uncalibrated hydrologic model. These results suggest hydrologic data around the world have commonalities which could be leveraged by deep learning, and significant synergies can be had with a simple modification of the currently predominant workflows, greatly expanding the reach of existing big data. Finally, this work diversified existing global streamflow benchmarks.

Wei Zhi

and 6 more

Dissolved oxygen (DO) sustains aquatic life and is an essential water quality measure. Our capabilities of forecasting DO levels, however, remain elusive. Unlike the increasingly intensive earth surface and hydroclimatic data, water quality data often have large temporal gaps and sparse areal coverage. Here we ask the question: can a Long Short-Term Memory (LSTM) deep learning model learn the spatio-temporal dynamics of stream DO from intensive hydroclimatic and sparse DO observations at the continental scale? That is, can the model harvest the power of big hydroclimatic data and use them for water quality forecasting? Here we used data from CAMELS-chem, a new dataset that includes sparse DO concentrations from 236 minimally-disturbed watersheds. The trained model can generally learn the theory of DO solubility under specific temperature, pressure, and salinity conditions. It captures the bulk variability and seasonality of DO and exhibits the potential of forecasting water quality in ungauged basins without training data. It however often misses concentration peaks and troughs where DO level depends on complex biogeochemical processes. The model surprisingly does not perform better where data are more intensive. It performs better in basins with low streamflow variations, low DO variability, high runoff-ratio (> 0.45), and precipitation peaks in winter. This work suggests that more frequent data collection in anticipated DO peak and trough conditions are essential to help overcome the issue of sparse data, an outstanding challenge in the water quality community.