A Gaussian Process (GP) \[{\rm GP}(\mu(\cdot),{\cal C}(\cdot,\cdot)): {\mathbb R}^n \to {\mathbb R} \ ,\] is a stochastic process over some \(N\)-dimensional space fully specified by a mean function \(\mu(\cdot)\) and a covariance function \({\cal C}(\cdot,\cdot)\). A stochastic process is a parameterized collection of random variables \(\{x_t\}_{t\in T}\) defined on a probability space \((\Omega, {\cal F}, P)\), and taking values in \(\mathbb{R}^m\). The indexing space \(T\) is usually \([0,\infty)\), but this is quite general and can be extended to subsets of \(\mathbb{R}^n\).

A GP is defined by the fundamental property that any finite marginalization of the process to some set of points \(X=\{x_1,\dots,x_k\}\) with be multivariate-normal (MVN) with mean and covariance given by \(\mu\) and \({\cal C}\). Thus, restricting the process to a single point \(x\) would yield \(P(x) \sim {\rm N}(\mu(x),{\cal C}(x,x))\), for a set of three points

\[\begin{aligned} x_1, x_2, x_3 &\sim {\rm MVN}(\vec{\mu}, K) \ , \\ \vec{\mu} &= (\mu(x_1), \mu(x_2), \mu(x_3))^{\sf T} \ , \\ K_{i,j} &= {\cal C}(x_i, x_j) \ .\end{aligned}\]

A GP is translation-invariant or stationary if

\[\begin{aligned} \mu(s) &= \mu(s+h) \ , \\ {\cal C}(s+h, t+h) &= {\cal C}(s,t) \ ,\end{aligned}\]

for all \(h\).

A GP is a probability distribution for functions with some mean and spatial correlation structure. Suppose that a GP is defined on a 1-D space: \(n=1\). Using the fundamental marginalization property, if we pick some finite set of \(k\) points at which to evaluate the process, the problem becomes one of drawing samples from a \(k\)-variate MVN. Given the mean and covariance functions \(\mu(\cdot), {\cal C}(\cdot,\cdot)\), we can generate a set of samples at these points as follows:

Compute the covariance matrix \({\cal C}\) for the \(k\) points, where \[{\cal C}_{ij} = {\cal C}(x_i, x_j) \ .\]

Compute the Cholesky decomposition \(S\) of the covariance matrix \({\cal C} = S S^{\sf T}\).

The vector \[z = \mu + S u \ ,\] where \(u\) is a vector of \(k\) standard normal samples, i.e., \(u_i \sim N(0,1)\), is the desired sample from the GP.

We directly see that the vector \(z\) has the correct expectation \({\rm E}[z] = \mu\). The covariance of \(z\) is also correct \[{\rm cov}[z] = {\rm E}[z z^{\sf T}] = {\rm E} [Su(Su)^{\sf T}] = S {\rm E}[u u^{\sf T}] S^{\sf T} = S S^{\sf T} = {\cal C} \ .\]

We can use a GP as a method for interpolating the output of a computer model, for the purposes of this section, we will not distinguish between the parameter sets \(u\) and \(x\). Let us denote the model output at a point \(x\) in the combined parameter space as \[Y_m (x) = f(x) \ , \ \mathbb{R}^n \to \mathbb{R} \ ,\] where for now we have assumed that we can make observations of the computer model output without any noise or uncertainty and that the model output is univariate and real. We will use a Bayesian approach to develop a statistical model of the output of the code, an emulator. This is done by taking a Gaussian process, with a given covariance function \({\cal C}\) and mean \(\mu\), as the prior distribution for the simulator output and conditioning it on a set of observations of the simulator.

Let us denote the design, the set of \(d\) points in the parameter space where the model has been evaluated as \[{\cal D} = \{x_1, x_2, \dots, x_d\} \ , x_i \in \mathbb{R}^n \ .\] The vector of \(d\) outputs evaluated at these points is \[{\cal Y} = (Y_m (x_1), Y_m(x_2),\dots,Y_m(x_d)) \ .\] Our GP prior amounts to \(Y_m|{\cal C},\mu \sim {\rm GP}(\mu,{\cal C})\), we can update this prior with our set of observations \(({\cal D},{\cal Y})\) to obtain a posterior distribution for the simulator output \(Y_*\) at some yet untried of \(k\) points \(X_* = \{\x_{*,1},\dots,x_{*,k}\}\). Our observations of the model output represent a finite marginalization of

These methods are predicated upon knowing the right prior mean and covariance functions. By picking certain parameterized functional forms for the prior covariance and mean, this problem can be split into two separate issues:

Model selection: which family of covariance function, or set of linear model (regression) basis functions is most suitable for describing the data set?

Estimation: given a family of covariance functions and a mean model parameterized by some set of values \(\theta\), which particular values \(\theta_0\) result in a posterior distribution which best reproduces the true model output \(h(t)\)?

Some general advice for model selection follows naturally from linear modeling:

Plot the data in many ways as you can. Although typically too complex-for-publication scatter-plot matrices can make a world of difference in terms of understanding the co-variation of the data.

Try and motivate modeling choices by an understanding of the underlying processes leading to the observations.

Favor moderately good, simple models over highly specialized ones.

The final point is aimed at avoiding over-fitting. An over-fitted model leaves very little room for further variation in the sample.

It will sometimes prove useful to have a full description of the probability distribution of the simulator given an emulator with some specified prior mean and covariance function given priors on their parameters. If we want to use the emulator as part of a larger statistical analysis of the behavior of the simulator, we will need the posterior distribution of the model conditioned on our observations, choice of GP prior mean, covariance, and the hyper-parameters which determine them.

Taking our design \({\cal D} = \{x_1, \dots, x_d\}\) as a set of \(d\) points in an \(n\)-dimensional subset of \(\mathbb{R}^n\), denoting our simulator as \(Y_m (x)\), and the training set as \({\cal Y} = \{Y_m (x_1), \dots, Y_m (x_d)\}\), then our prior on the model \(Y_m(\cdot)\) is a function of the prior mean and variance

The choice of GP covariance function is not entirely free. We are required to select from semi-definite functions It is useful to note that both the sum and product of pairs of positive-definite functions are also positive-definite. Complicated covariance structures with multiple characteristic length scales per dimension can be constructed.

A covariance function can be presented as an infinite sum of eigenvalues \(\lambda\) and eigenfunctions \(\phi\) \[{\cal C}(x_i, x_j) = \displaystyle\sum_{\ell = 1}^\infty \lambda_\ell \phi_\ell (x_i) \phi_\ell (x_j) \ ,\] where the eigenfunctions are orthogonal \[\int \phi_i (x) \phi_j (x) \ {\rm d} x = \delta_{ij} \ .\] From the positive-definite requirement and symmetry, it is clear that this spectral decomposition will always exist and be real for any finite covariance matrix. In \(d\) dimensions, the spectral density for an isotropic covariance function \({\cal C}(\cdot)\) is \[g(r) = \int_0^\infty r \left(\frac{\rho}{2\pi r}\right)^{d/2} J_{d/2 - 1} (r \rho) {\cal C} (\rho) \ {\rm d} \rho \ .\]

The power-exponential form is the most commonly used covariance form, this relatively simple form is flexible enough to handle most practical applications \[{\cal C}({\vec x},{\vec y},\theta,{\vec \delta},\alpha) = \theta \exp \left(-\frac{1}{2} \displaystyle\sum_{i=1}^{L} \frac{|x_i - y_i|^\alpha}{\delta_i^\alpha}\right) \ , \ 1 \le \alpha \le 2 \ .\] The overall variance for the process is set by \(\theta\), the scalars \(\delta\) set the characteristic correlation length scale in each of the dimensions spanned by the parameter space, finally the power \(\alpha\) sets the smoothness of draws from the process, for rather technical reasons relating to the spectral properties of the resulting Gaussian process a value just less than 2 is preferred.

Using a Gaussian process prior can be understood as an expression of the belief that the latent function has the form of a realization, i.e., a sample path of a Gaussian process. It is therefore of interest to understand the geometrical properties of Gaussian processes which can be linked to characteristics of the covariance function. This section describes

Suppose we have some simulator which produces output \(Y_m(x)\) where \(x\in\mathbb{R}^n\) is a point in \(n\)-dimensional parameter space. We can construct a GP emulator \(\eta\) for \(Y_m(x)\) given some set of \(d\) observations \({\cal Y} = \{Y_m(x_1), \dots, Y_m(x_d)\}\) of the model output evaluated according to some design \({\cal D}\{x_1,\dots,x_d\}\). We take the prior mean to be a linear model with some set of \(q\) regression functions \(h(x) = \{1,x,\dots\}\) and choose a suitable covariance function parameterized by length scales \(\Theta\). We can construct maximum-likelihood estimates \(\Theta^0\) for the correlation lengths and use these to obtain MLE values for the fit coefficients \({\hat\beta}\) and \(\hat{\sigma}^2\) overall variance.