# Interpretation and Complexity Reduction for Gaussian Processes Regression

[section] [theorem] [theorem]Lemma

Abstract

[section] [theorem] [theorem]Lemma

[section] [theorem] [theorem]Lemma

# Introduction

[section] [theorem] [theorem]Lemma

# Gaussian Processes Regression

\label{sec:GaussianProcessesRegression}

In multidimensional regression problem we assume that $$f: \mathcal{X} \rightarrow \mathbb{R}, \mathcal{X}\subset\mathbb{R}^m$$ is an unknow dependency function. We are given a noisy learning set $$D = \left\{\left(\mathbf{x}_i, y_i\right)\right\}$$, where $$y_i = f(\mathbf{x}_i) + \varepsilon_i, \mathbf{x}_i\in\mathcal{X}, \varepsilon_i\sim\mathcal{N}(0,\sigma^2)$$ for $$i=1,\dots,N$$ sampled independently and identically distributed (i.i.d.) from some unknown distribution. The goal is to predict the response $$\hat y^*$$ on unseen test points $$x^*$$ with small mean-squared error under the data distribution, i.e. find such function $$\hat{f}$$ from specific class $$\mathcal{C}$$ that approximation error on test set, $$D_{test} =\left\{\left(\mathbf{x}_j, y_j = f(\mathbf{x}_j)\right)\middle| j = \overline{1, N_*}\right\}$$, $\label{eq:approx_error} \varepsilon\left(\hat{f} \middle| D_{test}\right) = \sqrt{\frac{1}{N_*} \sum\limits_{j = 1}^{N_*} \bigl(y_j - \hat{f}(\mathbf{x}_j)\bigr)^2}.$ is minimum.

[section] [theorem] [theorem]Lemma

## Gaussain Processes

\label{sec:GaussinaProcesses} In this paper we consider a specific class of regression functions $$\mathcal{GP}$$ – Gaussian Processes. Any process $$P\in\mathcal{GP}$$ is uniqely defined by its mean $$\mu(\mathbf{x}) = \mathrm{E}\left[f(\mathbf{x})\right]$$ and covariance $$\mathrm{Cov}\left(y, y^\prime\right) = k\left(\mathbf{x}, \mathbf{x}^\prime\right) = \mathrm{E}\left[\left(f\left(\mathbf{x}\right) - \mu\left(\mathbf{x}\right)\right) \left(f\left(\mathbf{x}^\prime\right) - \mu\left(\mathbf{x}^\prime\right)\right)\right]$$ functions.

If the mean function is set to zero, i.e. $$\mu(\mathbf{x}) = \mathrm{E}\left[f\left(\mathbf{x}\right)\right] = 0$$, and covariance function is assumed to be known, aposterior mean value of the Gaussian Process in the test set $$X_*$$ has form (citation not found: Rasmussen) $$\hat{f}(X_*) = K_* K^{-1} Y$$, where $$K_* = K(X_*, X) = \left[k(\mathbf{x}_i, \mathbf{x}_j), i = \overline{1, N_*}, j = \overline{1,N}\right]$$ and $$K = K(X, X) = \left[k(\mathbf{x}_i, \mathbf{x}_j), i, j = \overline{1, N}\right]$$.

It is generally assumed that the data is obsereved with random noise: $$y(\mathbf{x}) = f(\mathbf{x}) + \varepsilon(\mathbf{x})$$, where $$\varepsilon(\mathbf{x})\sim\mathcal{N}(0, \tilde{\sigma}^2)$$. In that case observations $$y(\mathbf{x})$$ are generated by Gaussian Process with zero mean and covariance function $$\mathrm{Cov}\left(y(\mathbf{x}), y(\mathbf{x}^\prime)\right) = k(\mathbf{x}, \mathbf{x}^\prime) + \tilde{\sigma}^2\delta(\mathbf{x}- \mathbf{x}^\prime)$$, where $$\delta(\mathbf{x})$$ is a Dirac delta funciton.

Thus, aposterior mean funciton of Gaussian Process $$f(\mathbf{x})$$ in the points of test set $$X_*$$ takes form: $\hat{f}(X_*) = K_* \left(K + {\sigma}^2 I \right)^{-1} Y, \label{eq:meannoise}$ where $$I$$ – identity matrix of size $$(N \times N)$$.

Note, that noise variance $$\tilde{\sigma}^2$$ in (\ref{eq:meannoise}) in fact leads to regularization and more generalization ability of the resulting regression. Wherein the aposteriori covariance function of Gaussian Process in the points of test set takes form: $\mathrm{V} \left[X_*\right] = K(X_*, X_*) + \tilde{\sigma}^2 I_* - K_* \left(K + \tilde{\sigma}^2 I \right)^{-1} K_*^T, \label{eq:covariancenoise}$ where $$K(X_*, X_*) = \left[k(\mathbf{x}_i, \mathbf{x}_j) \middle| i, j = 1, \dots, N_*\right]$$ and $$I_*$$ – identity matrix of size $$(N_* \times N_*)$$.