ROUGH DRAFT authorea.com/49262

# Data Choices - Considerations for Uncertainty

Abstract

There is an increasing awareness of the importance of considering multiple sources of forcing data for down-scaling and impact modeling applications, given inherent uncertainties in climate projections and local sensitivities to large scale drivers. Due to the limited number and accessibility of large scale climate simulations, selecting such forcing data is typically an exercise in the random sampling of possible outcomes. This selection, however, has real implications in how the range of potential events and their associated uncertainties are perceived in both scientific and decision making contexts. We present a simple illustration of this situation, highlighting the place a given singular study inherently sits within the context of our larger, potentially unarticulated, understanding.

# Setting the stage

\label{opening} In the following we hope to motivate a discussion concerning how patterns of practice contribute to perceived uncertainty for decision makers. While any decision making process involves uncertainty, in the context of planning wrt climate impacts a chief concern is: “how well do we describe the range of risks we must consider responding to?” Identifying, quantifying, and reducing these uncertainties is a topic of much ongoing work and discussion; e.g,(Kalognomou 2013), (Goldstein 2013) (Gaganis 2008) (Stainforth 2007). Much of this discussion, however, relates to the domain of experimental design and model development; i.e., the creation of the simulation ensembles, and so are out of the control of the data user who draws on on the results to study potential impacts. Typically an impact modeller is not able to access the entirety of simulations that have been produced, nor has the computational resources to run their impacts model for all possible forcing in the first place. As such, the selection of forcing data is the main area where the impacts modeller has direct input into the representation of forcing uncertainties.

At this scale of implementation; i.e., selecting a few simulations to represent possible inputs for an impact model, there is a tension between a more formal understanding of uncertainty and the desire to impart a sense of what is likely and to communicate where there is and isn’t confidence. Large multi-model and/or calibrated ensembles attempt an [arguably quite limited (Knutti 2010) (Doherty 2010)] expression of such uncertainties. The sub-selection of potential forcings serves rather to acknowledge the indeterminate nature of these external conditions without quantifying their full potential scope. It is widely understood that different simulations represent a continuum of ’variations on a theme’ (Masson 2011), rather than distinct options where either A, B, or C is the ’correct’ choice. As such, an intuitive response is to look for a representation of the general expectations expressed by these simulations, such as an ensemble mean. For many applications, however, especially those determined by chronologies of events, this is not a viable approach. In hydrology duration and intensity of rainfall are key variables, and these sequences and extremes are lost in the process of model averaging. This leaves little option except choosing and applying what is hopefully a representative sample of simulation realisations. The concern is that this selection does make implicit assertions of confidence, even if it is by necessity undertaken in a haphazard way; e.g., if the selected subset of simulations happen to produce similar output for a given region, this appears at face value to imply this is a likely outcome.

These dilemmas are unavoidable under current practise. The resolution needs of hydrological modellers are often addressed using simple bias correction of GCM data, possibly spatially disaggregated, yet all still predicated on the original GCM grid cell data. There is much discussion of model selection in the literature … [insert discussion here, if it actually exists] … but given the inherent computational and scientific limitations in our ability to map the space of all possible climates and event chronologies, even the most critical evaluations are still performed on a sub-sample of a sub-sample. This implies that the considerations addressed here will be pertinent even as ensemble design and data access continue to improve.

# Outlining the (simplistic) approach of permutations

\label{methods} What sort of landscape is created by varied groups and agencies creating sub-ensembles of different sizes determined by their resources and needs? How much variation is there in the perceived messages of the incorporated forcing data? Here we create a simple illustration1. We take estimates of historical (1986-2005) climatological precipitation from $$\mathrm{n} = 7$$ CMIP5 (Taylor 2012) General Ciruclation Models (GCMs)2, see Table 1, for grid cells containing Johannesburg, South Africa. We then consider every possible simulation combination for ensembles from size $$\mathrm{k} = 1$$ to $$7$$. This gives $$\mathrm{n} \choose \mathrm{k}$$ sub-ensembles for each group size. We then consider the median, as well as average absolute deviation about the median3, for each sub-ensemble. This allows us to visualise how potentially different are the ensembles various groups are working with, and how these simple statistics compare to those of our ’full’ data collection. An expected outcome can be easily imagined; the central value and spread of seven different one member ensembles’ will be the same as that for an ensemble of seven members. Not clear a priori, however, is how the spread of sub-ensembles will evolve, and whether there are breakpoints of relevance that can inform selection size. Here we consider only annual and seasonal climatology values. This is a simplification for illustration purposes, as there are many other climatological features of equal or greater significance to hydrological applications.

Model Name Modelling Group
CanESM2 Canadian Centre for Climate Modelling and Analysis
CNRM2-CM5 Centre National de Recherches Météorologiques / Centre Européen de Recherche et Formation Avancée en Calcul Scientifique
FGOALS-g2 LASG, Institute of Atmospheric Physics, Chinese Academy of Sciences
GFDL-ESM2G NOAA Geophysical Fluid Dynamics Laboratory
MIROC5 Atmosphere and Ocean Research Institute (The University of Tokyo), National Institute for Environmental Studies, and Japan Agency for Marine-Earth Science and Technology
MIROC-ESM Japan Agency for Marine-Earth Science and Technology, Atmosphere and Ocean Research Institute (The University of Tokyo), and National Institute for Environmental Studies
MRI-CGCM3 Meteorological Research Institute

1. The situation will of course vary greatly with location, meteorological variables, time/spatial scales, and other factors.

2. That the presented ’meta’ analysis is done on a very reduced subset of the available models both allows a clearer presentation and emphasises the lack of absolutes/ground-truths in these sort of investigations.

3. This estimate of spread is important as impact modellers need output from specific models