Interpretation and Complexity Reduction for Gaussian Processes Regression

[section] [theorem] [theorem]Lemma


[section] [theorem] [theorem]Lemma

[section] [theorem] [theorem]Lemma


[section] [theorem] [theorem]Lemma

Gaussian Processes Regression


In multidimensional regression problem we assume that \(f: \mathcal{X} \rightarrow \mathbb{R}, \mathcal{X}\subset\mathbb{R}^m\) is an unknow dependency function. We are given a noisy learning set \(D = \left\{\left(\mathbf{x}_i, y_i\right)\right\}\), where \(y_i = f(\mathbf{x}_i) + \varepsilon_i, \mathbf{x}_i\in\mathcal{X}, \varepsilon_i\sim\mathcal{N}(0,\sigma^2)\) for \(i=1,\dots,N\) sampled independently and identically distributed (i.i.d.) from some unknown distribution. The goal is to predict the response \(\hat y^*\) on unseen test points \(x^*\) with small mean-squared error under the data distribution, i.e. find such function \(\hat{f}\) from specific class \(\mathcal{C}\) that approximation error on test set, \(D_{test} =\left\{\left(\mathbf{x}_j, y_j = f(\mathbf{x}_j)\right)\middle| j = \overline{1, N_*}\right\}\), \[\label{eq:approx_error} \varepsilon\left(\hat{f} \middle| D_{test}\right) = \sqrt{\frac{1}{N_*} \sum\limits_{j = 1}^{N_*} \bigl(y_j - \hat{f}(\mathbf{x}_j)\bigr)^2}.\] is minimum.

[section] [theorem] [theorem]Lemma

Gaussain Processes

\label{sec:GaussinaProcesses} In this paper we consider a specific class of regression functions \(\mathcal{GP}\) – Gaussian Processes. Any process \(P\in\mathcal{GP}\) is uniqely defined by its mean \(\mu(\mathbf{x}) = \mathrm{E}\left[f(\mathbf{x})\right]\) and covariance \(\mathrm{Cov}\left(y, y^\prime\right) = k\left(\mathbf{x}, \mathbf{x}^\prime\right) = \mathrm{E}\left[\left(f\left(\mathbf{x}\right) - \mu\left(\mathbf{x}\right)\right) \left(f\left(\mathbf{x}^\prime\right) - \mu\left(\mathbf{x}^\prime\right)\right)\right]\) functions.

If the mean function is set to zero, i.e. \(\mu(\mathbf{x}) = \mathrm{E}\left[f\left(\mathbf{x}\right)\right] = 0\), and covariance function is assumed to be known, aposterior mean value of the Gaussian Process in the test set \(X_*\) has form (citation not found: Rasmussen) \(\hat{f}(X_*) = K_* K^{-1} Y\), where \(K_* = K(X_*, X) = \left[k(\mathbf{x}_i, \mathbf{x}_j), i = \overline{1, N_*}, j = \overline{1,N}\right]\) and \(K = K(X, X) = \left[k(\mathbf{x}_i, \mathbf{x}_j), i, j = \overline{1, N}\right]\).

It is generally assumed that the data is obsereved with random noise: \( y(\mathbf{x}) = f(\mathbf{x}) + \varepsilon(\mathbf{x})\), where \(\varepsilon(\mathbf{x})\sim\mathcal{N}(0, \tilde{\sigma}^2)\). In that case observations \(y(\mathbf{x})\) are generated by Gaussian Process with zero mean and covariance function \(\mathrm{Cov}\left(y(\mathbf{x}), y(\mathbf{x}^\prime)\right) = k(\mathbf{x}, \mathbf{x}^\prime) + \tilde{\sigma}^2\delta(\mathbf{x}- \mathbf{x}^\prime)\), where \(\delta(\mathbf{x})\) is a Dirac delta funciton.

Thus, aposterior mean funciton of Gaussian Process \(f(\mathbf{x})\) in the points of test set \(X_*\) takes form: \[\hat{f}(X_*) = K_* \left(K + {\sigma}^2 I \right)^{-1} Y, \label{eq:meannoise}\] where \(I\) – identity matrix of size \((N \times N)\).

Note, that noise variance \(\tilde{\sigma}^2\) in (\ref{eq:meannoise}) in fact leads to regularization and more generalization ability of the resulting regression. Wherein the aposteriori covariance function of Gaussian Process in the points of test set takes form: \[\mathrm{V} \left[X_*\right] = K(X_*, X_*) + \tilde{\sigma}^2 I_* - K_* \left(K + \tilde{\sigma}^2 I \right)^{-1} K_*^T, \label{eq:covariancenoise}\] where \(K(X_*, X_*) = \left[k(\mathbf{x}_i, \mathbf{x}_j) \middle| i, j = 1, \dots, N_*\right]\) and \(I_*\) – identity matrix of size \((N_* \times N_*)\).