Summary: Complete Dictionary Recovery over the Sphere

Main Idea


The data matrix is generated as \(\mathbf{Y} = \mathbf{A}_0 \mathbf{X}_0\), where \(\mathbf{Y} \in \mathbb{R}^{n \times p}\), and the following conditions are assumed.

  1. The dictionary \(\mathbf{A} \in \mathbb{R}^{n \times n}\) is complete, namely square and invertible.

  2. The elements in \(\mathbf{X}_0\) is independently drawn from Bernoulli-Gaussian (BG) model. \(X_{0{[ij]}} = \Omega_{ij} V_{ij}\), where \(\Omega_{ij} \sim \text{Ber}(\theta)\) and \(V_{ij} \sim N(0,1)\).

The goal is to recover \(\mathbf{A}_0\) and \(\mathbf{X}_0\) from \(\mathbf{Y}\).

The main result is that when \(\theta \in (0, 1/3)\), \(\mathbf{A}_0\) and \(\mathbf{X}_0\) can be exactly recovered with a polynomial time algorithm with high probability, if number of samples \(p\) is no less than a polynomial of \((n, 1/\theta, \kappa(\mathbf{A}_0), 1/\mu)\), where \(\kappa(\mathbf{A}_0)\) is the conditional number of the dictionary.

Non-convex optimization formulation

Since \(\mathbf{A_0}\) is complete, the row space of \(\mathbf{Y}\) is equivalent to the row space of \(\mathbf{X}_0\). By the result from (Spielman 2013), under the generative model, rows of \(\mathbf{X}_0\) corresponds to the \(n\) sparsest vectors in the row space. Once \(\mathbf{X}_0\) is recovered, \(\mathbf{A}_0\) could be solved from least square \(\mathbf{A}_0 = \mathbf{Y} \mathbf{X}_0 (\mathbf{X}_0 \mathbf{X}_0^T)^{-1}\).

A straightforward formulation for recovering one row of \(\mathbf{X}_0\) is \[\min \| \mathbf{q}^{T} \mathbf{Y} \|_0 \quad s.t.\ \mathbf{q} \neq \mathbf{0}.\] \(\mathbf{X}_0\) can be recovered row by row by deflation.

However, this is discontinuous and the scale of \(\mathbf{q}\) is unidentifiable. The problem above is recast as

\[\min f(\mathbf{q}; \hat{\mathbf{Y}}) = \frac{1}{p} \sum_{k=1}^p h_{\mu} (\mathbf{q}^T \hat{\mathbf{y}_k}), \quad s.t. \ \|q\|_2 = 1,\] where \(\hat{\mathbf{Y}}\) is the data after pre-conditioning. The loss function is a special convex function, namely

\[h_{\mu} (z) = \mu \log \cosh(z / \mu).\]

Next, we summarize some of the key properties of this problem, which are later formalized by the paper.

No spurious local minima

When \(\mathbf{A}_0\) is orthogonal, \(f(\mathbf{q}, \mathbf{Y})\) has no spurious local minima. In fact, each local minimizer \(\hat{\mathbf{q}}\) is sufficient close to a target \(\mathbf{q}_{\ast}\), which corresponds to a row of \(\mathbf{X}_0\) by \(\mathbf{q}_{\ast}^T \mathbf{Y} = \alpha \mathbf{e}_i^T \mathbf{X}_0\). \(\mathbf{q}_{\ast}\) can be obtained from \(\hat{\mathbf{q}}\) by rounding.

Symmetry and reduction

  1. When \(\mathbf{A}_0\) is not orthogonal, it can be reduced to the orthogonal case by preconditioning \(\hat{\mathbf{Y}} = \big(\frac{1}{p \theta} \mathbf{Y} \mathbf{Y}^T \big)^{-1/2} \mathbf{Y}\) for some \(i \in [n]\) and \(\alpha \neq 0\).

  2. All orthogonal \(\mathbf{A}_0\) are equivalent since \(f(\mathbf{q}; \mathbf{A}_0 \mathbf{X}_0) = f(\mathbf{A_0}^T \mathbf{q}; \mathbf{X}_0)\), corresponding to a rotation on \(\mathbf{q}\). Therefore, it suffices to study \(\mathbf{A}_0 = \mathbf{I}\) only.

  3. The feasible region \(\mathbf{q} \in \mathbb{S}^{n-1}\) is a sphere with \(2n\) symmetric hemispheres, corresponding to those centered around \(\pm \mathbf{e}_i\) for \(i=1,\cdots,n\). It suffices to study only the one centered around \(\mathbf{e}_n\).

  4. The hemisphere can be bijectively mapped to the equatorial plane \(\mathbf{e}_n^{\bot}\) with the coordinates \(\mathbf{w} \in \mathbb{B}^{n-1}\) in the unit ball, by \[\mathbf{q}(\mathbf{w}) = (\mathbf{w}, \sqrt{1-\|\mathbf{w}\|_2^2}).\]

  5. It suffices to study \(g(\mathbf{w}, \mathbf{X}_0) := f(\mathbf{q}(\mathbf{w}); \mathbf{X_0})\) over the set \(\Gamma := \{\mathbf{w}: \|\mathbf{w}\|_2 < \sqrt{\frac{4n-1}{4n}} \}\). In this set, \(\mathbf{w} = \mathbf{0}\) corresponds to the local minima (\(\mathbf{A}_0 = \mathbf{I}\)).

Some direction to descend, always

The objective \(g(\mathbf{w}, \mathbf{X}_0)\) is well behaved in \(\Gamma\), in the sense that there is always some direction to descend with high probability. It is shown that \(\Gamma\) is a union of three regions \(R_{I}\), \(R_{II}\) and \(R_{III}\) in increasing norm of \(\mathbf{w}\).

It holds with high probability that (Theorem 2.1)

  1. For all \(\mathbf{w} \in R_{I}\), objective is strongly convex: \(\nabla^2 g(\mathbf{w}; \mathbf{X}_0) \geq \frac{c \theta}{\mu} \mathbf{I} \);

  2. For all \(\mathbf{w} \in R_{II}\), objective has non-zero gradient: \(\frac{\mathbf{w}^T \nabla g}{\|\mathbf{w}\|_2} \geq c \theta\);

  3. For all \(\mathbf{w} \in R_{III}\), objective has negative curvature: \(\frac{\mathbf{w}^T \nabla^2 g \mathbf{w}}{\|w\|_2^2} \leq -c\theta\).

This result is shown by computing the gradient and Hessian of \(\mathbb{E}_{\mathbf{X}_0 \sim \text{BG}} g(\mathbf{w}; \mathbf{X}_0)\) and then a concentration argume