MPSLIB: A C++ class for sequential simulation of multiple-point statistical models
Geostatistical simulation methods allow simulation of spatial structures and patterns based on a choice of statistical model. In the last few decades multiple-point statistics (MPS) has been developed that allow inferring the statistical model from a training image. This allows for a simpler quantification of the statistical model, and simulation of more realistic (Earth) structures. A number of different algorithms for MPS based simulation have been proposed, each associated with a unique set of pros or cons. MPSLIB is a C++ class that provides a framework for implementing most of the currently proposed multiple-point simulation methods based on sequential simulation. A number of the most widely used methods are provided as an example. The single normal equation simulation (SNESIM) method is implemented using both a tree and a list structure. A new generalized ENESIM (GENESIM) algorithm is proposed that can act as (in one extreme) the ENESIM algorithm, and (in another extreme) similar to the direct sampling algorithm. MPSLIB aims to be easy to compile on most platforms (standard C++11 is the only requirement) and is released under the Open Source LGPLv3 License to encourage reuse and further development.
\label{sec:introduction} Geostatistics is type of statistics with a focus on describing geo-spatial (Earth) structures in a probabilistic framework through a probability distribution. A ‘geostatistical model’ refer to a selection of a probability distribution to reflect a specific Earth model. Geostatistical models are typically used to characterize subsurface reservoir models used for groundwater, hydrocarbon or heat storage, for both exploration, exploitation, and management. A geostatistical model describes infinitely many single Earth models, consistent with the chosen probability function. The variability of these Earth models reflect the uncertainty related to the choice of statistical model. Such uncertainty quantification is the key advantage of using geostatistical models, as opposed to considering only one, in some sense, optimal model.
Geostatistical modeling is a two-step process. First a statistical model must be selected or described, typically through a probability distribution \(f({\bf m})\). Once a model has been established, actual realizations from the probability distribution is generated using ‘simulation algorithms’, which is the topic of this manuscript.
Traditionally, geostatistical simulation algorithms are based on Gaussian statistics, typically based on statistics only between pairs of model parameters, and hence referred to as 2-point statistics. These methods has been, and are still, widely used (Deutsch 1998). However, 2-point based statistical models are limited with respect to the spatial structures they can represent. With the introduction of multiple-point statistical (MPS) models, more geologically realistic spatial structures can be quantified (Guardiano 1993, Strebelle 2002). This has led to the development of a number of simulation algorithms for multiple-point simulation, e.g. (Guardiano 1993, Strebelle 2002, Straubhaar 2011, Mariethoz 2010). For MPS based statistical models, there is usually no closed form analytical expression of \(f({\bf m})\). Instead, a ‘training image’ or a ‘sample model’ is assumed available from which multiple-point statistical events can be inferred. The goal of MPS methods is to allow sampling from the unknown \(f({\bf m})\) such that realizations are consistent with the statistics that can be inferred from the training image. For an overview of MPS based simulation algorithms see (Mariethoz 2014).
Many of the proposed MPS algorithms are implemented in various forms. However, some of these codes are either not maintained (Remy 2002, Remy 2009), not available on multiple platforms (Remy 2009), not available under an open source license (Ar2Tech 2015, Ephesia Consult 2015, Ephesia Consult 2015a), or does only implement one specific type of MPS algorithm (Strebelle 2015).
Here a C++ library, MPSLIB, designed specifically for multiple-point simulation is presented, that is released under the GPLv3 license. MPSLIB implements the core functionalities needed to implement any multiple-point simulation algorithm, based on sequential simulation (Alabert 1989). Note that this does not include methods based on pattern matching (Wu 2008) and image-synthesis (Mariethoz 2014a). Implementations of the ‘extended normal equations simulation’ (ENESIM) (Guardiano 1993) and the ‘single normal equation simulation’ (SNESIM) (Strebelle 2002, Straubhaar 2011) algorithms are provided as examples. Further, a new algorithm GENESIM, based on the ENESIM algorithm, is proposed that can act similar to both the ENESIM and the direct sampling algorithms (Mariethoz 2010).
\label{sec:description} MPSLIB is written in C++ (Stroustrup 1986) using c++11 (ISO 2011). It consist of a C++ class that provides the framework for applying sequential simulation to sample from multiple-point statistical models. It also contains a number of algorithms implemented using MPSLIB.
At the core of the library is an implementation of sequential simulation, that can be briefly described as follows: Say a probability distribution describes the relation between \(M\) model parameters through \(f({\bf m})\)=\(f(m_1, m_2, ..., m_M)\) that are typically ordered by some nodes on a grid. Then a realization of \(f({\bf m})\) can be generated using sequential simulation as follows:
Visit a random node, say \(m_1\), and generate a realization of \(f({m_1})\) as \(m_1^*\).
Move to another node, say \(m_2\), and generate a realization of \(f({m_2 | m_1^*})\) as \(m_2^*\).
Move to another node, say \(m_3\), and generate a realization of \(f({m_3 | m_1^*, m_2^*})\) as \(m_3^*\).
...
Move to the last node, \(m_M\), and generate a realization of \(f({m_M | m_1^*, m_2^*, ..., m_{M-1}^*})\) as \(m_M^*\).
This will generate a realization of \({\bf m^*} = [m_1^* , m_2^* ,..., m_M^* ,]\) from \(f(\bf m)\). See more details in e.g. (Gomez-Hernandez 1993).
The major challenge implementing the sequential simulation algorithm is to generate a realization from the conditional distribution at each iteration. If the conditional data are given by \({\bf m_c^*}\), then the conditional distribution can in general be given by \[f(m_i | {\bf m}_c^*). \label{eqn:cond}\] If uncertain co-located information about \(m_i\) is available as \(f({m_i}_{soft})\) (often referred to as ‘soft’ data in geostatistical literature) then it is accounted for by drawing from \[f(m_i | {\bf m}_c^*) \ f({m_i}_{soft}). \label{eqn:cond_soft}\]
For MPS based models there will, in general, be no analytical form of \(f(m_i | {\bf m}_c^*)\) available. Instead, the key idea utilized in most simulation algorithms is to infer information about the conditional distribution, by scanning the training image for the conditional data event, \({\bf m}_c^*\). There are (at least) two different approaches for sampling from the conditional distribution in Eqn. \ref{eqn:cond}. In one approach (ENESIM type algorithms), the training image is scanned for matching conditional data events at each iteration of the sequential simulation algorithm, from which the conditional distribution can be estimated (Guardiano 1993), or a realization of the conditional distribution directly obtained (Mariethoz 2010). In another approach, the training image is scanned prior to running the simulation, and the conditional statistics for a number of predefined data templates are stored in memory. The conditional distribution, Eqn. \ref{eqn:cond}, can then be retrieved from memory during simulation (Strebelle 2002, Straubhaar 2011). Usually, storing conditional statistics in memory requires the use of so-called multiple-grids (Strebelle 2002) in order to allow simulation of structures with correlations over longer distances, while at the same time imposing manageable memory and CPU requirements. Using multiple-grids, simulation is performed starting in a coarse grid that is refined until simulation is performed on the finest, and requested, grid size. The use of multiple-grids poses a challenge to conditional simulation (when the value of some model parameters are known before simulation starts). One approach to handle this is to make use of data-relocation. When simulating on coarser grids, conditional data one finer grids are relocated to the closest node in the coarse grid. When conditional simulation has been performed on a coarser grid, re-located data on the coarse grid are removed, and simulation is then performed on a finer grid. For details on multiple-grids and the use of data-relocation see (Hernandez 1993, Strebelle 2002, Straubhaar 2014).
MPS
is a namespace that contains different classes. The main class is the MPSalgorithm
class, which implements the sequential simulation algorithm, methods for reading and writing 3D gridded data, methods for reading known data values (known as hard and soft data), and methods for establishing a data neighborhood, and controlling the simulation path. In addition, MPSAlgorithm
allow multiple grids, including handling of conditional data. Specifically, grid re-location of hard data, as proposed in (Strebelle 2002), is implemented.
Also, two abstract member functions are defined, but not implemented: MPSAlgorithm::readConfig
and MPSAlgorithm::simulate
.
That is, in order to use the MPS class, a new C++ class that inherits the MPSAlgorithm
class needs to be defined, that implements (at least) the two member functions. MPSAlgorithm::readConfig
should set all the parameters needed to run the simulation, for example by reading from a parameter file. MPSAlgorithm::simulate
should implement a method that allows generating a realization from Eqn. \ref{eqn:cond}. As discussed above, the main difference between proposed multiple-point simulation algorithms is related to how this function is implemented, and can be divided into two families of algorithms, those that are similar to ENESIM and those that are similar to SNESIM. Therefore, two generals classes, MPSEnesim
and MPSSnesim
, has been implemented to handle operations specific to ENESIM and SNESIM type simulation.
The ENESIM
subclass extends the main MPSAlgorithm
class with methods and functions specifically designed to use in case the training image is scanned at each iteration. This includes methods for scanning the whole training image in order to estimate the conditional distribution, Eqn. \ref{eqn:cond}, from which a realization can be draw. It also includes methods for scanning only parts of the training image, and optionally only scan the training image until a maximum number of conditional data events has been found.
The SNESIM
subclass extends the main MPS
class with methods and functions specifically designed in case the statistics from the training images is scanned prior to running the simulation algorithm and stored in memory.
The SNESIM
and ENESIM
subclasses form the base of the proposed C++ framework. A key idea is that it should be possible to implement any MPS based sequential simulation algorithm using the C++ Class. Figure \ref{fig:arch} illustrates the architecture of MPSLIB.
To demonstrate the versatility of MPSLIB a number of the most well-known multiple-point simulation algorithms, as well as a new variant, based on sequential simulation has been implemented.
The mps_genesim
algorithm is a generalized implementation of the MPS algorithm named ENESIM proposed by (Guardiano 1993).
In the simplest form of the ENESIM algorithm, the whole training image is scanned to establish the conditional distribution in Eqn. \ref{eqn:cond} at each iteration. This is relatively easy to implement, and is void of problems related to the use of multiple grids. However, it is also computationally extremely demanding, to the point where it is not practical to use. The reason is due to the fact the full training image is scanned at each iteration, in order to obtain a conditional distribution from which a realization can be drawn. In order to alleviate this we suggest to scan the training image only for a maximum number \(N_{max}\) of conditional data events, which means the conditional distribution will only be based on (at the most) \(N_{max}\) counts.
MPSAlgorithm::simulate
is implemented by scanning the training image for a maximum of \(N_{max}\) replicates, from a random starting point. This provide an approximation to the full conditional distribution, from which a realization is generated.
At one extreme, \(N_{max}=\infty\), this algorithm will produce the same results as the classical ENESIM algorithm.
At another extreme, \(N_{max}=1\), this algorithm will provide essentially the same results as the direct sampling algorithm, with the specific difference that using mps_genesim
the conditional distribution is in fact estimated (even if based on only one count), from which a realization is generated. Using direct sampling the conditional distribution is never obtained, and instead a realization from Eqn. \ref{eqn:cond} is taken directly from the training image as the first matching conditional data event(Mariethoz 2010). In this case, when the conditional distribution is based on only one observed conditional event, conditioning to uncertain data, as in Eqn \ref{eqn:cond_soft}, cannot be done.
For \(N_{max}>1\) and \(N_{max}<\infty\) mps_genesim
leads to an algorithm that can approximately account for uncertain data using \ref{eqn:cond_soft}, while at the same avoiding the high CPU requirements associated to scanning the full training image.
The main contribution of SNESIM is that conditional patterns are scanned from the training image prior to simulation, and stored in a search tree (Strebelle 2002). Therefore a member function is implemented that allow scanning a training image for a set of predefined data templates. The corresponding conditional statistics are stored in a binary search tree. MPSAlgorithm::simulate
is implemented such that the conditional distribution in Eqn. \ref{eqn:cond} is obtained by searching through the binary search tree. Then a realization is generated from Eqn. \ref{eqn:cond}.
mps_snesim_list
is very similar to mps_snesim_snesim
. The main difference is that mps_snesim_list
stores conditional statistics using a list rather than a search tree. This lead to less memory requirements, but slower sequential search for the conditional distribution, compared to using a binary search tree (Straubhaar 2011).
The two SNESIM based algorithms will though not produce exactly the same result as they differ slightly in how a missing data event is handled.
The algorithms described above implements the core ideas of the SNESIM, ENESIM, and GENESIM algorithms. Many variations of the methods are available, and many types of post-processing has been presented, some of which may be included in MPSLIB in the future (Remy 2009, Zhang 2012). A natural feature to implement next is to allow simulation of non-stationary fields, by using multiple training images and/or scaling and rotation of on training image, which has not yet been considered (Remy 2009, Mariethoz 2011). The use of parallel computing and GPUs will also a natural development (Mariethoz 2010, Tahmasebi 2012).
\label{sec:examples} As an example, based on the training image in Figure \ref{fig:ti}a, the implemented algorithms will be used to generated realizations in a 2D 80x50 pixel grid, using the and hard and soft data shown in Figure \ref{fig:ti}b.
The implemented algorithms are based on the same core functionality, and therefore the parameter files needed to run the algorithms are very similar. The following text file is part of the parameter file for all simulations algorithms:
Number of realizations # 3
Random Seed (0 `random` seed) # 1
Simulation grid size X # 80
Simulation grid size Y # 50
Simulation grid size Z # 1
Simulation grid world/origin X # 0
Simulation grid world/origin Y # 0
Simulation grid world/origin Z # 0
Simulation grid grid cell size X # 1
Simulation grid grid cell size Y # 1
Simulation grid grid cell size Z # 1
Training image file (spaces not allowed) # ti.dat
Output folder (spaces in name not allowed) # .
Shuffle Simulation Grid path (1 : random, 0 : sequential) # 1
Shuffle Training Image path (1 : random, 0 : sequential) # 1
HardData filename (same size as the simulation grid)# conditional.dat
HardData seach radius (world units) # 1
Softdata categories (separated by ;) # 0;1
Soft datafilenames (separated by ; only need (number_categories - 1) grids) # soft.dat
Number of threads (minimum 1, maximum 8 - depend on your CPU) # 1
Debug mode(2: write to file, 1: show preview, 0: show counters, -1: no ) # -1
The parameter file above defines a 2D regular grid (80 by 50 pixels) where each pixel has a size of 1x1 and the coordinates of the first pixel is (0,0). The training image must be given in an ASCII formatted GSLIB file (a type of simplified Geo-EAS formatted file, see (Deutsch 1998)) called ‘ti.dat’, where the first line contains the dimension of the training image. For example, the first line for a 100x100x10 training image must be ’100 100 10’. One can optionally make use of hard data (data known with no uncertainty), or soft data (data that reflect uncertain observations), both in GSLIB format. The number of threads is currently not used, but kept to allow the use of multi threading in the future. Finally the debug mode defines the amount of information written to disk and screen during simulation.
Both mps_snesim_tree
and mps_snesim_list
make use of the same additional information in the parameter file:
Number of multiple grids # 4
Search template size X # 7
Search template size Y # 7
Search template size Z # 1
Maximum number conditional data (-1: all) # 49
Min Node count (-1: ignore)# 5
The first line sets the number of multiple grids used. To disable the use of multiple-grids, “0” should be chosen. The next three lines indicate the search template size, which is here set as 7x7x1. This means that any conditional data event within a 7x7 grid size is scanned, and its associated conditional statistics stored, prior to running the simulation algorithm. The next line sets the maximum number of conditional data within the template to consider. If set to zero all conditional data within the template will be used.
The last line sets the minimum number of replicates needed in order to use an obtained conditional distribution. If the number of conditional data is below this number then one conditional data is discarded, until enough replicates are found.
The mps_genesim
algorithm needs the following additional parameters defined:
Max number of conditional point, N_cond # 49
Maximum number of counts for conditional pdf, N_max # 1
Max number of iterations, N_max_ite # 10000
The first line define the maximum number of conditional points needed during simulation. This is related to the template size for the SNESIM type algorithms. However, as no multiple-grids are used, conditioning can be effective over ranges spanning the whole simulation grid. The next two lines define properties specifically related to the GENESIM algorithm. Line 2 defines the maximum number of points, \(N_{max}\), used to setup the conditional distribution Eqn. \ref{eqn:cond}. However, even if not enough conditional points have been obtained after \(N_{max\_ite}=10000\) iterations of scanning for a match (as define in the last line), scanning is stopped and the conditional distribution used as is.
In case \(N_{max}=\infty\) and \(N_{max\_ite}=\infty\), then mps_genesim
will behave just as the ENESIM algorithm. \(N_{max}=1\) and \(N_{max\_ite}=\infty\) will lead to an algorithm similar direct sampling, in the sense that if needed, the whole training image will be scanned for a replicate. The original implementation of direct sampling does not allow for conditioning to ‘soft’ uncertain data in a manner similar to the ENESIM and SNESIM algorithm. However, the GENESIM algorithm, using a finite \(N_{max\_ite}\) has the potential to decrease the simulation time significantly compared to a traditional ENESIM implementation, while at the same time allowing for conditioning to uncertain data through Eqn. \ref{eqn:cond_soft}.
Figure \ref{fig:reals}, shows three independent realizations conditional to only the hard data, generated using mps_snesim_tree
, mps_snesim_list
, and mps_snesim_genesim
with \(N_{max}=\infty\), \(N_{max}=10\), and \(N_{max}=1\). Figure \ref{fig:reals_soft}, shows the same results as Figure \ref{fig:reals} but in case conditioning to both hard and soft data. Note how soft data is not taken into account, as discussed previously, when using GENESIM in style of direct sampling with \(N_{max}=1\), Figure \ref{fig:reals_soft}e. On the other hand, using \(N_{max}=10\), the GENESIM algorithm can take soft data into account, Figure \ref{fig:reals_soft}d.
\label{sec:impact} MPSLIB has been developed to be useful to both beginners and experts using multiple-point geostatistics. It should be useful for both scientific research and commercial application. A goal has been to provide a foundation that allows developing, testing, and comparing the many different approaches for utilizing multiple-point simulation, which are all largely built on the same basic idea.
Perhaps the most widely known and used software for geostatistical estimation and simulation is GSLIB (Deutsch 1998). GSLIB, written in Fortran, provides both a set of building blocks, and numerous examples of how to use these building blocks to implement useful algorithms, for 2-point statistical modeling. The history of GSLIB has shown that availability of free and open software can be a main driver in the development of a research field, benefiting both basic research and commercial application. The design of MPSLIB is chosen with GSLIB in mind.
MPSLIB should be flexible, easy to extend, easy to compile, and applicable for both scientific and commercial purposes. MPSLIB should alleviate the possibility to test out new ideas, without the need to re-implement every aspect of sequential simulation. Currently no such computer code is available, that use both a permissive open source license, and is maintained.
The development of the generalized ENESIM algorithm is an example on how the existing code can be extended to produce a new type of multiple-point statistical algorithm with some novel features. We hope this example, and the base code, will encourage other developers to extend the code in a similar manner.
The software was developed in a collaboration between researchers from a public institution and a commercial company. The company will use the code as the back end for a user orientated geological model building experience. The researchers will use the code for scientific research related to multiple-point statistical modeling in the future.
MPSLIB is a C++ library for multiple-point based statistical models based on sequential simulation. It is intended to provide the building blocks that allow implementation of any multiple-point simulation method based on sequential simulation. As examples, variants of ENESIM and SNESIM type multiple-point algorithms have been implemented and a new algorithm proposed. MPSLIB relies only on standard C++11 and should be easy to compile, modify and extend, and is released under the LGPLv3 license.
The code needed to generate Figures \ref{fig:ti}-\ref{fig:reals_soft} is part of the source code, as well as a Matlab interface that allow running the developed algorithms in parallel using the Matlab Parallel Computing toolbox. The development of MPSLIB has been developed as part of the ERGO project, funded by a grant from the Danish High Technology Foundation. We thank Giulio Vignoli, GEUS, Denmark, for testing and feedback.
Nr. | Code metadata description | |
---|---|---|
C1 | Current code version | 1.1 |
C2 | Permanent link to code/repository | https://github.com/ElsevierSoftwareX/SOFTX-D-16-00042 |
C3 | Legal Code License | LGPLv3 |
C4 | Code versioning system used | git |
C5 | Software code languages, tools, | C++ |
C6 | Compilation requirements | C++11 |
C7 | Link to developer documentation/manual | https://github.com/ElsevierSoftwareX/SOFTX-D-16-00042/blob/master/README.md |
C8 | Support email for questions | ltv@i-gis.dk, tmeha@nbi.ku.dk |
F Alabert, others. Non-Gaussian data expansion in the earth sciences. Terra Nova 1, 123–134 Wiley Online Library, 1989.
Ar2Tech. Ar2Gems. (2015). Link
C. V. Deutsch, A. G. Journel. GSLIB, Geostatistical Software Library and User’s Guide. 384 Oxford University Press, 1998.
Ephesia Consult. Impala. (2015). Link
Ephesia Consult. DeeSee. (2015). Link
J.J. Gomez-Hernandez, A.G. Journel. Joint sequential simulation of multi-Gaussian fields. Geostatistics Troia 92, 85–94 (1993).
F. Guardiano, R.M. Srivastava. Multivariate geostatistics: beyond bivariate moments. Geostatistics-Troia 1, 133–144 (1993).
JJ Gomez Hernandez. A stochastic approach to the simulation of block conductivity fields conditioned upon data measured at a smaller scale. (1993).
ISO. Information technology – Programming languages – C++. (2011).
Gregoire Mariethoz, Philippe Renard, Julien Straubhaar. The Direct Sampling method to perform multiple-point geostatistical simulations. Water Resources Research 46 Wiley Online Library, 2010.
Gregoire Mariethoz, Jef Caers. Multiple-point Geostatistics: Stochastic Modeling with Training Images. John Wiley & Sons, 2014.
Gregoire Mariethoz, Sylvain Lefebvre. Bridges between multiple-point geostatistics and texture synthesis: Review and guidelines for future research. Computers & Geosciences 66, 66–80 Elsevier, 2014.
Gregoire Mariethoz, Bryce FJ Kelly. Modeling complex geological structures with elementary training images and transform-invariant distances. Water Resources Research 47 Wiley Online Library, 2011.
Grégoire Mariethoz. A general parallelization strategy for random path based geostatistical simulation methods. Computers & Geosciences 36, 953–958 Elsevier, 2010.
Nicolas Remy, Arben Shtuka, Bruno Levy, Jef Caers. G S TL: the geostatistical template library in C++. Computers & geosciences 28, 971–979 Elsevier, 2002.
Nicolas Remy, Alexandre Boucher, Jianbing Wu. Applied geostatistics with SGeMS: a user’s guide. Cambridge University Press, 2009.
Julien Straubhaar, Philippe Renard, Grégoire Mariethoz, Roland Froidevaux, Olivier Besson. An improved parallel multiple-point algorithm using a list approach. Mathematical Geosciences 43, 305–328 Springer, 2011.
Julien Straubhaar, Duccio Malinverni. Addressing conditioning data in multiple-point statistics simulation algorithms based on a multiple grid approach. Mathematical Geosciences 46, 187–204 Springer, 2014.
S. Strebelle. Conditional simulation of complex geological structures using multiple-point statistics. Math. Geol 34, 1-20 (2002).
S. Strebelle. SNESIM. GitHub repository GitHub, 2015. Link
Bjarne Stroustrup. The C++ programming language. Pearson Education India, 1986.
Pejman Tahmasebi, Muhammad Sahimi, GréGoire Mariethoz, Ardeshir Hezarkhani. Accelerating geostatistical simulations using graphics processing units (GPU). Computers & Geosciences 46, 51–59 Elsevier, 2012.
Jianbing Wu, Alexandre Boucher, Tuanfeng Zhang. A SGeMS code for pattern simulation of continuous and categorical variables: FILTERSIM. Computers & Geosciences 34, 1863–1876 Elsevier, 2008.
Tuanfeng Zhang, Stein Inge Pedersen, Christen Knudby, David McCormick. Memory-efficient categorical multi-point statistics algorithms based on compact search trees. Mathematical Geosciences 44, 863–879 Springer, 2012.