# Thomas Mejer Hansen1, Le Thanh Vu2, Torben Bach3This is the postprint of the following manuscript, to be published in Sowftware X in 2016: Hansen, T.M., Vu, L.T., and Bach, Torben, 2016. MPSLIB: A C++ class for sequential simulation of multiple-point statistical models. SoftwareX. doi:10.1016/j.softx.2016.07.001. Niels Bohr Instiute, University of Copenhagen, Denmark↩ I-GIS, Aarhus, Denmark↩ I-GIS, Aarhus, Denmark↩

Abstract

Geostatistical simulation methods allow simulation of spatial structures and patterns based on a choice of statistical model. In the last few decades multiple-point statistics (MPS) has been developed that allow inferring the statistical model from a training image. This allows for a simpler quantification of the statistical model, and simulation of more realistic (Earth) structures. A number of different algorithms for MPS based simulation have been proposed, each associated with a unique set of pros or cons. MPSLIB is a C++ class that provides a framework for implementing most of the currently proposed multiple-point simulation methods based on sequential simulation. A number of the most widely used methods are provided as an example. The single normal equation simulation (SNESIM) method is implemented using both a tree and a list structure. A new generalized ENESIM (GENESIM) algorithm is proposed that can act as (in one extreme) the ENESIM algorithm, and (in another extreme) similar to the direct sampling algorithm. MPSLIB aims to be easy to compile on most platforms (standard C++11 is the only requirement) and is released under the Open Source LGPLv3 License to encourage reuse and further development.

# Motivation and significance

\label{sec:introduction} Geostatistics is type of statistics with a focus on describing geo-spatial (Earth) structures in a probabilistic framework through a probability distribution. A ‘geostatistical model’ refer to a selection of a probability distribution to reflect a specific Earth model. Geostatistical models are typically used to characterize subsurface reservoir models used for groundwater, hydrocarbon or heat storage, for both exploration, exploitation, and management. A geostatistical model describes infinitely many single Earth models, consistent with the chosen probability function. The variability of these Earth models reflect the uncertainty related to the choice of statistical model. Such uncertainty quantification is the key advantage of using geostatistical models, as opposed to considering only one, in some sense, optimal model.

Geostatistical modeling is a two-step process. First a statistical model must be selected or described, typically through a probability distribution $$f({\bf m})$$. Once a model has been established, actual realizations from the probability distribution is generated using ‘simulation algorithms’, which is the topic of this manuscript.

Traditionally, geostatistical simulation algorithms are based on Gaussian statistics, typically based on statistics only between pairs of model parameters, and hence referred to as 2-point statistics. These methods has been, and are still, widely used (Deutsch 1998). However, 2-point based statistical models are limited with respect to the spatial structures they can represent. With the introduction of multiple-point statistical (MPS) models, more geologically realistic spatial structures can be quantified (Guardiano 1993, Strebelle 2002). This has led to the development of a number of simulation algorithms for multiple-point simulation, e.g. (Guardiano 1993, Strebelle 2002, Straubhaar 2011, Mariethoz 2010). For MPS based statistical models, there is usually no closed form analytical expression of $$f({\bf m})$$. Instead, a ‘training image’ or a ‘sample model’ is assumed available from which multiple-point statistical events can be inferred. The goal of MPS methods is to allow sampling from the unknown $$f({\bf m})$$ such that realizations are consistent with the statistics that can be inferred from the training image. For an overview of MPS based simulation algorithms see (Mariethoz 2014).

Many of the proposed MPS algorithms are implemented in various forms. However, some of these codes are either not maintained (Remy 2002, Remy 2009), not available on multiple platforms (Remy 2009), not available under an open source license (Ar2Tech 2015, Ephesia Consult 2015, Ephesia Consult 2015a), or does only implement one specific type of MPS algorithm (Strebelle 2015).

Here a C++ library, MPSLIB, designed specifically for multiple-point simulation is presented, that is released under the GPLv3 license. MPSLIB implements the core functionalities needed to implement any multiple-point simulation algorithm, based on sequential simulation (Alabert 1989). Note that this does not include methods based on pattern matching (Wu 2008) and image-synthesis (Mariethoz 2014a). Implementations of the ‘extended normal equations simulation’ (ENESIM) (Guardiano 1993) and the ‘single normal equation simulation’ (SNESIM) (Strebelle 2002, Straubhaar 2011) algorithms are provided as examples. Further, a new algorithm GENESIM, based on the ENESIM algorithm, is proposed that can act similar to both the ENESIM and the direct sampling algorithms (Mariethoz 2010).

# Software description

\label{sec:description} MPSLIB is written in C++ (Stroustrup 1986) using c++11 (ISO 2011). It consist of a C++ class that provides the framework for applying sequential simulation to sample from multiple-point statistical models. It also contains a number of algorithms implemented using MPSLIB.

At the core of the library is an implementation of sequential simulation, that can be briefly described as follows: Say a probability distribution describes the relation between $$M$$ model parameters through $$f({\bf m})$$=$$f(m_1, m_2, ..., m_M)$$ that are typically ordered by some nodes on a grid. Then a realization of $$f({\bf m})$$ can be generated using sequential simulation as follows:

1. Visit a random node, say $$m_1$$, and generate a realization of $$f({m_1})$$ as $$m_1^*$$.

2. Move to another node, say $$m_2$$, and generate a realization of $$f({m_2 | m_1^*})$$ as $$m_2^*$$.

3. Move to another node, say $$m_3$$, and generate a realization of $$f({m_3 | m_1^*, m_2^*})$$ as $$m_3^*$$.

4. ...

5. Move to the last node, $$m_M$$, and generate a realization of $$f({m_M | m_1^*, m_2^*, ..., m_{M-1}^*})$$ as $$m_M^*$$.

This will generate a realization of $${\bf m^*} = [m_1^* , m_2^* ,..., m_M^* ,]$$ from $$f(\bf m)$$. See more details in e.g. (Gomez-Hernandez 1993).

The major challenge implementing the sequential simulation algorithm is to generate a realization from the conditional distribution at each iteration. If the conditional data are given by $${\bf m_c^*}$$, then the conditional distribution can in general be given by $f(m_i | {\bf m}_c^*). \label{eqn:cond}$ If uncertain co-located information about $$m_i$$ is available as $$f({m_i}_{soft})$$ (often referred to as ‘soft’ data in geostatistical literature) then it is accounted for by drawing from $f(m_i | {\bf m}_c^*) \ f({m_i}_{soft}). \label{eqn:cond_soft}$

For MPS based models there will, in general, be no analytical form of $$f(m_i | {\bf m}_c^*)$$ available. Instead, the key idea utilized in most simulation algorithms is to infer information about the conditional distribution, by scanning the training image for the conditional data event, $${\bf m}_c^*$$. There are (at least) two different approaches for sampling from the conditional distribution in Eqn. \ref{eqn:cond}. In one approach (ENESIM type algorithms), the training image is scanned for matching conditional data events at each iteration of the sequential simulation algorithm, from which the conditional distribution can be estimated (Guardiano 1993), or a realization of the conditional distribution directly obtained (Mariethoz 2010). In another approach, the training image is scanned prior to running the simulation, and the conditional statistics for a number of predefined data templates are stored in memory. The conditional distribution, Eqn. \ref{eqn:cond}, can then be retrieved from memory during simulation (Strebelle 2002, Straubhaar 2011). Usually, storing conditional statistics in memory requires the use of so-called multiple-grids (Strebelle 2002) in order to allow simulation of structures with correlations over longer distances, while at the same time imposing manageable memory and CPU requirements. Using multiple-grids, simulation is performed starting in a coarse grid that is refined until simulation is performed on the finest, and requested, grid size. The use of multiple-grids poses a challenge to conditional simulation (when the value of some model parameters are known before simulation starts). One approach to handle this is to make use of data-relocation. When simulating on coarser grids, conditional data one finer grids are relocated to the closest node in the coarse grid. When conditional simulation has been performed on a coarser grid, re-located data on the coarse grid are removed, and simulation is then performed on a finer grid. For details on multiple-grids and the use of data-relocation see (Hernandez 1993, Strebelle 2002, Straubhaar 2014).

## Software Architecture

mps_genesim is a namespace that contains different classes. The main class is the MPSAlgorithm::simulate class, which implements the sequential simulation algorithm, methods for reading and writing 3D gridded data, methods for reading known data values (known as hard and soft data), and methods for establishing a data neighborhood, and controlling the simulation path. In addition, mps_genesim allow multiple grids, including handling of conditional data. Specifically, grid re-location of hard data, as proposed in (Strebelle 2002), is implemented.

Also, two abstract member functions are defined, but not implemented: mps_genesim and MPSAlgorithm::simulate .

That is, in order to use the MPS class, a new C++ class that inherits the mps_snesim_list class needs to be defined, that implements (at least) the two member functions. mps_snesim_snesim should set all the parameters needed to run the simulation, for example by reading from a parameter file. mps_snesim_list should implement a method that allows generating a realization from Eqn. \ref{eqn:cond}. As discussed above, the main difference between proposed multiple-point simulation algorithms is related to how this function is implemented, and can be divided into two families of algorithms, those that are similar to ENESIM and those that are similar to SNESIM. Therefore, two generals classes, and , has been implemented to handle operations specific to ENESIM and SNESIM type simulation.

#### ENESIM

The subclass extends the main class with methods and functions specifically designed to use in case the training image is scanned at each iteration. This includes methods for scanning the whole training image in order to estimate the conditional distribution, Eqn. \ref{eqn:cond}, from which a realization can be draw. It also includes methods for scanning only parts of the training image, and optionally only scan the training image until a maximum number of conditional data events has been found.

#### SNESIM

The subclass extends the main class with methods and functions specifically designed in case the statistics from the training images is scanned prior to running the simulation algorithm and stored in memory.

The and subclasses form the base of the proposed C++ framework. A key idea is that it should be possible to implement any MPS based sequential simulation algorithm using the C++ Class. Figure \ref{fig:arch} illustrates the architecture of MPSLIB.