Classification of covariance matrices

\label{sec:classofcov}

Fundamentals of classification

\label{subsec:fund_class}

Given labelled samples \(x_{i}\) drawn from two classes or populations (positive and negative), a simple classification algorithm consists in assigning a previously unseen sample to the class with closer mean. This implies a computation of means of class and a measure of distances from the means. Assuming that the samples are embedded into a dot product space (i.e. with Euclidean geometry), the means can be computed as:

\begin{equation} \label{eq:mean_eucl1} \label{eq:mean_eucl1}c_{+}=\frac{1}{m_{+}}\sum_{\left\{i|y_{i}=+1\right\}}x_{i},\\ \end{equation}
\begin{equation} \label{eq:mean_eucl2} \label{eq:mean_eucl2}c_{-}=\frac{1}{m_{-}}\sum_{\left\{i|y_{i}=-1\right\}}x_{i},\\ \end{equation}

where \(y_{i}\in\left\{+1,-1\right\}\) is the label of the training sample \(x_{i}\), \(m_{+}\) and \(m_{-}\) the number of positive and negative samples respectively. An unseen sample \(x\) is assigned to the class whose mean is the closest. This simple geometric classification framework is the founding principle of more complex algorithms such as supporting vector machines.

It can be formulated in terms of the dot product \(\left\langle\cdot,\cdot\right\rangle\). If \(c:=(c_{+}+c_{-})/2\) is the point lying halfway between \(c_{+}\) and \(c_{-}\), and \(w:=c_{+}-c_{-}\) the vector connecting \(c_{+}\) to \(c_{-}\), the class label \(y\) of the unseen sample \(x\) is determined by checking whether the vector \(x-c\) connecting \(c\) to \(x\) makes an angle \(\alpha<\pi/2\) with \(w\) [ref: fig]. This is expressed as:

\begin{equation} \label{eq:classif1} \label{eq:classif1}\begin{split}\displaystyle y&\displaystyle=\mathrm{sgn}\left\langle(x-c),w\right\rangle\\ &\displaystyle=\mathrm{sgn}\left\langle(x-(c_{+}+c_{-})/2),(c_{+}-c_{-})\right\rangle\\ &\displaystyle=\mathrm{sgn}(\left\langle x,c_{+}\right\rangle-\left\langle x,c_{-}\right\rangle+b)\end{split}\\ \end{equation}

where \(\mathrm{sgn}\) is the sign function. The offset \(b\) vanishes if class means are equidistant to the origin \cite{scholkopf_learning_2001}. Inserting (\ref{eq:mean_eucl1}) and (\ref{eq:mean_eucl2}) in (\ref{eq:classif1}) yields:

\begin{equation} \label{eq:classif2} \label{eq:classif2}y=\mathrm{sgn}\left(\frac{1}{m_{+}}\sum_{\left\{i|y_{i}=+1\right\}}\left\langle x,x_{i}\right\rangle-\frac{1}{m_{-}}\sum_{\left\{i|y_{i}=-1\right\}}\left\langle x,x_{i}\right\rangle+b\right).\\ \end{equation}

Classifier (\ref{eq:classif2}) can be generalized as:

\begin{equation} \label{eq:classif_gen} \label{eq:classif_gen}y=\mathrm{sgn}\left(\sum_{i=1}^{m}y_{i}\alpha_{i}d(x,x_{i})+b\right),\\ \end{equation}

where \(\alpha_{i}\) is the weight of the training sample \(x_{i}\) and \(d(\cdot,\cdot)\) a distance, a divergence of a kernel. In the case of two classes, with \(m\) samples (\(m=m_{+}+m_{-}\)) in the dot product space where all samples have the same weight, \(y_{i}\in\left\{+1,-1\right\}\), \(\alpha_{i}=1/m\) and \(d(\cdot,\cdot)=\left\langle\cdot,\cdot\right\rangle\).

Expression (\ref{eq:classif_gen}) corresponds to the decision function used in hyperplane classifiers \cite{scholkopf_learning_2001}. This shows that even more complex classifiers rely on the calculations of class means (or centers) and their distances to individual samples. This being shown, in this article we focus on the simple classification approach of assigning a previously unseen sample to the class with closest mean.

In machine learning algorithms, samples are represented by their features which are determined through a feature extraction and selection process. In this work, samples are represented by their covariance matrices. Therefore means of classes and distances to mean will be means of covariance matrices and distance between them.

Means of covariance matrices

\label{subsec:mean}

Consider a multivariate variable \(X\in\mathbb{R}^{C\times N}\) where \(C\) is the number of variables and \(N\) the number of samples, with \(C>N\), the covariance matrix of the centered variable \(X\) can be estimated as:

\begin{equation} \Sigma=\frac{1}{N}XX^{\intercal}\\ \end{equation}

and is symmetric positive definite (SPD): \(\Sigma\in\mathcal{M}_{C}\), a manifold of \(C\times C\) symmetric positive definite matrices,

\begin{equation} \label{eq:rm} \label{eq:rm}\mathcal{M}_{C}=\left\{\Sigma\in\mathbb{R}^{C\times C}:\Sigma=\Sigma^{\intercal}\text{ and }x^{\intercal}\Sigma x>0,\forall x\in\mathbb{R}^{C}\backslash 0\right\}\ .\nonumber \\ \end{equation}

The properties of SPD matrices constrain them to a convex cone:

    [label=()]
  1. 1.

    Symmetry: \tab\(\Sigma=\Sigma^{\intercal}\) ,

  2. 2.

    Positive definiteness: \tab\(x^{\intercal}\Sigma x>0,\forall x\in\mathbb{R}^{C}\backslash 0\) ,

  3. 3.

    Strict positivity of diagonal element: \tab\(\Sigma(i,j)>0|i=j,\forall i,j\in\left\{1,\dots,C\right\}\) i.e. positive variance ,

  4. 4.

    Cauchy-Schwarz inequalities: \tab\(|\Sigma(i,j)|\leq(\Sigma(i,i)\Sigma(j,j))^{1/2},\forall i,j\in\left\{1,\dots,C\right\}\) .

The mean of SPD matrices can be computed as a center of mass modeled on Euclidean geometry: given a set of covariance matrices \(\{\Sigma_{i}\}_{i=1,\dots,I}\), the center of mass \(\bar{\Sigma}\) of the set, is a covariance matrix that minimizes the sum of the squared distances to matrices \(\Sigma_{i}\):

\begin{equation} \label{eq:mean} \label{eq:mean}\bar{\Sigma}=\mu(\Sigma_{1},\dots,\Sigma_{I})=\arg\!\min_{\Sigma\in\mathcal{M}_{C}}\sum_{i=1}^{I}d^{2}(\Sigma_{i},\Sigma)\ ,\\ \end{equation}

where \(d(\cdot,\cdot)\) is a measure of distance between two matrices. Practically, \(d(\cdot,\cdot)\) can either be a distance or a divergence.

In the literature, this mean is at times designated as the Frechet mean, Cartan mean, or Karcher mean 11This appellation has been recently criticized by Karcher himself —\cite{karcher_riemannian_2014} \cite{ando2004geometric,lim_matrix_2012}. Cartan \cite{cartan_groupes_1929} had shown that a unique solution to (\ref{eq:mean}) exists if all \(\Sigma_{i}\) lie in a convex ball \citep[section 16 of][]{cartan_groupes_1929}. This applies also to closed convex cones.

Depending on the divergence or distance used, several means can be defined from (\ref{eq:mean}). Those considered in this study are presented in the next lines and summarized in Table \ref{tab:dist}.

Distances and divergences

\label{sec:distdiv}

Divergences and distances are measures of dissimilarity between two points in a space. Here a Riemannian space \(\mathcal{M}\) will be considered. A distance function \(d:\mathcal{M}\times\mathcal{M}\rightarrow\mathbb{R}^{+}\) has the following properties for all \(\Sigma_{1},\Sigma_{2},\Sigma_{3}\in\mathcal{M}\):

    [label=()]
  1. 1.

    Non-negativity: \tab\(d(\Sigma_{1},\Sigma_{2})\geq 0\) ,

  2. 2.

    Identity: \tab\(d(\Sigma_{1},\Sigma_{2})=0\ \ \mathrm{iff}\ \ \Sigma_{1}=\Sigma_{2}\) ,

  3. 3.

    Symmetry: \tab\(d(\Sigma_{1},\Sigma_{2})=d(\Sigma_{2},\Sigma_{1})\) ,

  4. 4.

    Triangular inequality: \tab\(d(\Sigma_{1},\Sigma_{3})\leq d(\Sigma_{1},\Sigma_{2})+d(\Sigma_{2},\Sigma_{3})\) .

Divergences are very similar to distances with the difference that properties (iii) and (iv) do not have to be satisfied. In the context of covariance matrices, divergences and distances should both induce a Riemannian metric on the manifold of SPD matrices.

Euclidean distance

The Euclidean distance between two matrices is represented by the Frobenius norm of their difference:

\begin{equation} \label{eq:dist_eucl} \label{eq:dist_eucl}d_{\operatorname{E}}(\Sigma_{1},\Sigma_{2})=\lVert\Sigma_{1}-\Sigma_{2}\rVert_{F}\\ \end{equation}

In (\ref{eq:mean}), this yields the arithmetic mean:

\begin{equation} \label{eq:mean_arithmetic} \label{eq:mean_arithmetic}\bar{\Sigma}_{\operatorname{E}}=\frac{1}{I}\sum_{i=1}^{I}\Sigma_{i}\\ \end{equation}

The arithmetic mean is drawn from a family of power means (\(\Sigma_{t|t=1}\)\cite{lim_matrix_2012}:

\begin{equation} \label{eq:mean_power} \label{eq:mean_power}\Sigma_{t}=\left(\frac{1}{I}\sum_{i=1}^{I}\Sigma_{i}^{t}\right)^{\frac{1}{t}},\ t\in[-1,+1].\\ \end{equation}

From the same family can be drawn the geometric mean (\(\Sigma_{t|t\rightarrow 0}\)) and the harmonic mean (\(\Sigma_{t|t=-1}\)).

We consider the arithmetic mean \(\bar{\Sigma}_{\operatorname{E}}\), as a baseline. This averaging of covariance is usually not adequate in the space of SPD matrices for two main reasons. Firstly, the Euclidean distance and averaging do not guarantee invariance under inversion: a matrix and its inverse are supposed to be at the same distance from the identity matrix. Secondly, the Euclidean averaging of covariance SPD leads to a swelling effect: the determinant of the arithmetic mean of SPD matrices can be larger than the determinant of its individual components. And since the determinant of a covariance matrix is a direct measure of the dispersion of the multivariate variable, the swelling effect introduces a large distortion of the data dispersion \cite{arsigny_geometric_2007}. For these reasons, other means that adapt to the geometry of convex cone of SPD matrices are used.

Affine-invariant Riemannian distance

The affine invariant Riemannian distance between two points is defined by the length of the curve connecting these point on the Riemannian manifold. A differential manifold is a topological curved space that is locally similar to the Euclidean space and is globally differentiable. The convex cone of SPD matrices is a manifold that can be endowed with a Riemannian metric; such manifolds are called Riemannian manifold. Let \(\mathcal{M}\) be a Riemannian manifold, and \(T_{\Sigma}\mathcal{M}\) its tangent space defined on point \(\Sigma\). A Riemannian metric \(\mathit{d}\) is a family of inner product defined on the tangent spaces defined on each point \(\Sigma\) of the manifold. This inner product varies smoothly from point to point on the manifold,

\begin{equation} \mathit{d}_{\Sigma}:T_{\Sigma}\mathcal{M}\times T_{\Sigma}\mathcal{M}\rightarrow\mathbb{R}\nonumber \\ \end{equation}

\(\mathit{d}\) is a function that assigns, for each point \(\Sigma\in\mathcal{M}\), an inner product in the tangent space \(T_{\Sigma}\mathcal{M}\) . The Riemannian metric allows us to compute the length of vectors or distance between two point on the tangent space, and through appropriate mapping \cite{pennec_riemannian_2006}, the length of the corresponding geodesic (i.e. the shortest curve connecting two point) on the manifold \(\mathcal{M}\). The affine-invariant Riemannian distance is the distance between two points of a Riemannian manifold endowed with an invariant Riemannian metric \(g_{\Sigma}\) defined at \(\Sigma\):

\begin{equation} \label{eq:metric-riemann} \label{eq:metric-riemann}\begin{split}\displaystyle g_{\Sigma}(\Theta_{1},\Theta_{2})&\displaystyle=\left\langle\Theta_{1},\Theta_{2}\right\rangle_{\Sigma}\\ &\displaystyle=\left\langle\Sigma^{-\frac{1}{2}}\Theta_{1}\Sigma^{-\frac{1}{2}},\Sigma^{-\frac{1}{2}}\Theta_{2}\Sigma^{-\frac{1}{2}}\right\rangle_{\mathbf{I}}\\ &\displaystyle=\operatorname{tr}\left(\Sigma^{-\frac{1}{2}}\Theta_{1}\Sigma^{-1}\Theta_{2}\Sigma^{-\frac{1}{2}}\right),\end{split}\\ \end{equation}

where \(\mathbf{I}\) is the identity matrix and \(\operatorname{tr}\) the trace operator. The inner product of the tangent vectors \(\Theta_{1}\) and \(\Theta_{2}\) at \(\Sigma\) is invariant by the action of \(\Sigma^{-\frac{1}{2}}\) transformation. The affine-invariant Riemannian distance is defined as:

\begin{equation} \label{eq:dist_air} \label{eq:dist_air}d_{\operatorname{AIR}}(\Sigma_{1},\Sigma_{2})=\lVert\operatorname{Log}(\Sigma_{1}^{-1}\Sigma_{2})\rVert_{F}=\left(\sum_{c=1}^{C}\log^{2}\lambda_{c}\right)^{1/2},\\ \end{equation}

where \(\operatorname{Log}\) is the matrix logarithm instead of \(\log\) for scalars, and \(\lambda_{c}\), \(c=1,\dots,C\) are the eigenvalues of \(\Sigma_{1}^{-1}\Sigma_{2}\).

Inserting (\ref{eq:dist_air}) in (\ref{eq:mean}) yields the mean \(\bar{\Sigma}_{\operatorname{AIR}}\) associated to the affine-invariant Riemannian metric. It is the solution to

\begin{equation} \label{mean_air} \label{mean_air}\sum_{i=1}^{I}\operatorname{Log}(\bar{\Sigma}_{\operatorname{AIR}}^{-1/2}\Sigma_{i}\bar{\Sigma}_{\operatorname{AIR}}^{-1/2})=0\\ \end{equation}

It has no close form solution and can be solved iteratively through a gradient descent algorithm \cite{fletcher2004principal}.

These distance and mean are invariant to affine transformations. Some of these invariances are particularly important to preserve the geometric topology of the Riemannian manifold of SDP matrices. Let \(f\) be an affine-invariant Riemannian function defined on \(\mathcal{M}\) (e.g. distance or mean),

    [label=()]
  1. 1.

    Invariance under congruent transformation

    \begin{equation} \label{eq:invar_congr} \label{eq:invar_congr}f(\Sigma_{1},\Sigma_{2})=f(\Sigma\Sigma_{1}\Sigma^{\intercal},\Sigma\Sigma_{2}\Sigma^{\intercal})\\ \end{equation}
  2. 2.

    Invariance under inversion

    \begin{equation} \label{eq:invar_invers} \label{eq:invar_invers}f(\Sigma,\mathbf{I})=f(\Sigma^{-1},\mathbf{I})\\ \end{equation}

    implying

    \begin{equation} \label{eq:invar_invers2} \label{eq:invar_invers2}f(\Sigma_{1},\Sigma_{2})=f(\Sigma_{1}^{-1},\Sigma_{2}^{-1})\\ \end{equation}

Another interesting property of the affine-invariant Riemannian metric is its invariance to left- and right-multiplication by a positive matrix.

\begin{equation} \label{eq:invar_mult} \label{eq:invar_mult}f(\Sigma_{1},\Sigma_{2})=f(\Sigma\Sigma_{1},\Sigma\Sigma_{2})=f(\Sigma_{1}\Sigma,\Sigma_{2}\Sigma)\\ \end{equation}

Log-Euclidean distance

The Log-Euclidean is another distance that takes into consideration the topology of Riemannian manifolds. It was introduced by Arsigny et al. \citep{arsigny_geometric_2007} to alleviate the complexity involved in the computation of the affine-invariant Riemannian distance and its related mean. The mean associated to the Log-Euclidean distance corresponds to an arithmetic mean in the domain of matrix algorithm. The distance between two SPD matrices is expressed as

\begin{equation} \label{eq:dist_LE} \label{eq:dist_LE}d_{\operatorname{LE}}(\Sigma_{1},\Sigma_{2})=\lVert\operatorname{Log}(\Sigma_{1})-\operatorname{Log}(\Sigma_{2})\rVert_{F},\\ \end{equation}

and its associated mean is defined explicitly:

\begin{equation} \label{eq:mean_LE} \label{eq:mean_LE}\bar{\Sigma}_{\operatorname{LE}}=\operatorname{Exp}\left(\frac{1}{I}\sum_{i=1}^{I}\operatorname{Log}(\Sigma_{i})\right).\\ \end{equation}

Unlike the affine-invariant Riemannian mean, the Log-Euclidean mean as a closed form expression which gives it a large computational advantage. Moreover, the obtained mean is, to a large extent, similar to the affine-invariant Riemannian mean:

    [label=()]
  1. 1.

    they have the same determinants which correspond to the geometric mean of the determinants of their building matrices:

    \begin{equation} \det\bar{\Sigma}_{\operatorname{LE}}=\det\bar{\Sigma}_{\operatorname{AIR}}=\prod_{i=1}^{I}(\det\Sigma_{i})^{1/I}=\exp\left(\frac{1}{I}\sum_{i=1}^{I}\log(\det\Sigma_{i})\right)\nonumber \\ \end{equation}

    ;

  2. 2.

    they are often equal in value, if not, \(\operatorname{tr}(\bar{\Sigma}_{\operatorname{LE}})>\operatorname{tr}(\bar{\Sigma}_{\operatorname{AIR}})\) ;

  3. 3.

    Log-Euclidean mean has properties close to affine-invariance (i.e. similarity-invariance instead of congruent-invariance).

Bregman divergences

Divergences have been considered for the computation of mean in applications of clustering and classification of SPD matrices due to the fact that they induce a Riemannian metric given by (\ref{eq:metric-riemann}). Consider a strictly convex and differentiable function \(f:\mathbb{R}\rightarrow\mathbb{R}\); then \(f(x)\geq f(y)+f^{\prime}(y)(x-y)\) and \(f(x)=f(y)+f^{\prime}(y)(x-y)\Leftrightarrow x=y\) for all \(x,y\in\mathbb{R}\). The Bregman divergence, \citep[introduced by Bregman in][]{bregman_relaxation_1967} is the difference between the left and right sides of the inequality:

\begin{equation} \label{eq:bregman-div} \label{eq:bregman-div}D_{\operatorname{}}{f}(x,y)=f(x)-f(y)-f^{\prime}(y)(x-y).\\ \end{equation}

\(f\) is called a seed function. It is shown that \(D_{\operatorname{}}{f}\) verifies the non-negativity and the identity properties. When the seed function is quadratic, it can also be symmetric. There is another set of properties that \(D_{\operatorname{}}{f}\) verifies, reported in \cite{bregman_relaxation_1967}. Geometrically, the Bregman divergence can be seen as the measure of the difference between \(f(x)\) and its representation on the plane tangent to \(f\) at \(y\) as illustrated in Fig. \ref{fig:bregman-projection}.