ROUGH DRAFT authorea.com/91175

# Main Idea

## Objective

The data matrix is generated as $$\mathbf{Y} = \mathbf{A}_0 \mathbf{X}_0$$, where $$\mathbf{Y} \in \mathbb{R}^{n \times p}$$, and the following conditions are assumed.

1. The dictionary $$\mathbf{A} \in \mathbb{R}^{n \times n}$$ is complete, namely square and invertible.

2. The elements in $$\mathbf{X}_0$$ is independently drawn from Bernoulli-Gaussian (BG) model. $$X_{0{[ij]}} = \Omega_{ij} V_{ij}$$, where $$\Omega_{ij} \sim \text{Ber}(\theta)$$ and $$V_{ij} \sim N(0,1)$$.

The goal is to recover $$\mathbf{A}_0$$ and $$\mathbf{X}_0$$ from $$\mathbf{Y}$$.

The main result is that when $$\theta \in (0, 1/3)$$, $$\mathbf{A}_0$$ and $$\mathbf{X}_0$$ can be exactly recovered with a polynomial time algorithm with high probability, if number of samples $$p$$ is no less than a polynomial of $$(n, 1/\theta, \kappa(\mathbf{A}_0), 1/\mu)$$, where $$\kappa(\mathbf{A}_0)$$ is the conditional number of the dictionary.

## Non-convex optimization formulation

Since $$\mathbf{A_0}$$ is complete, the row space of $$\mathbf{Y}$$ is equivalent to the row space of $$\mathbf{X}_0$$. By the result from (Spielman 2013), under the generative model, rows of $$\mathbf{X}_0$$ corresponds to the $$n$$ sparsest vectors in the row space. Once $$\mathbf{X}_0$$ is recovered, $$\mathbf{A}_0$$ could be solved from least square $$\mathbf{A}_0 = \mathbf{Y} \mathbf{X}_0 (\mathbf{X}_0 \mathbf{X}_0^T)^{-1}$$.

A straightforward formulation for recovering one row of $$\mathbf{X}_0$$ is $\min \| \mathbf{q}^{T} \mathbf{Y} \|_0 \quad s.t.\ \mathbf{q} \neq \mathbf{0}.$ $$\mathbf{X}_0$$ can be recovered row by row by deflation.

However, this is discontinuous and the scale of $$\mathbf{q}$$ is unidentifiable. The problem above is recast as

$\min f(\mathbf{q}; \hat{\mathbf{Y}}) = \frac{1}{p} \sum_{k=1}^p h_{\mu} (\mathbf{q}^T \hat{\mathbf{y}_k}), \quad s.t. \ \|q\|_2 = 1,$ where $$\hat{\mathbf{Y}}$$ is the data after pre-conditioning. The loss function is a special convex function, namely

$h_{\mu} (z) = \mu \log \cosh(z / \mu).$

Next, we summarize some of the key properties of this problem, which are later formalized by the paper.

#### No spurious local minima

When $$\mathbf{A}_0$$ is orthogonal, $$f(\mathbf{q}, \mathbf{Y})$$ has no spurious local minima. In fact, each local minimizer $$\hat{\mathbf{q}}$$ is sufficient close to a target $$\mathbf{q}_{\ast}$$, which corresponds to a row of $$\mathbf{X}_0$$ by $$\mathbf{q}_{\ast}^T \mathbf{Y} = \alpha \mathbf{e}_i^T \mathbf{X}_0$$. $$\mathbf{q}_{\ast}$$ can be obtained from $$\hat{\mathbf{q}}$$ by rounding.

#### Symmetry and reduction

1. When $$\mathbf{A}_0$$ is not orthogonal, it can be reduced to the orthogonal case by preconditioning $$\hat{\mathbf{Y}} = \big(\frac{1}{p \theta} \mathbf{Y} \mathbf{Y}^T \big)^{-1/2} \mathbf{Y}$$ for some $$i \in [n]$$ and $$\alpha \neq 0$$.

2. All orthogonal $$\mathbf{A}_0$$ are equivalent since $$f(\mathbf{q}; \mathbf{A}_0 \mathbf{X}_0) = f(\mathbf{A_0}^T \mathbf{q}; \mathbf{X}_0)$$, corresponding to a rotation on $$\mathbf{q}$$. Therefore, it suffices to study $$\mathbf{A}_0 = \mathbf{I}$$ only.

3. The feasible region $$\mathbf{q} \in \mathbb{S}^{n-1}$$ is a sphere with $$2n$$ symmetric hemispheres, corresponding to those centered around $$\pm \mathbf{e}_i$$ for $$i=1,\cdots,n$$. It suffices to study only the one centered around $$\mathbf{e}_n$$.

4. The hemisphere can be bijectively mapped to the equatorial plane $$\mathbf{e}_n^{\bot}$$ with the coordinates $$\mathbf{w} \in \mathbb{B}^{n-1}$$ in the unit ball, by $\mathbf{q}(\mathbf{w}) = (\mathbf{w}, \sqrt{1-\|\mathbf{w}\|_2^2}).$

5. It suffices to study $$g(\mathbf{w}, \mathbf{X}_0) := f(\mathbf{q}(\mathbf{w}); \mathbf{X_0})$$ over the set $$\Gamma := \{\mathbf{w}: \|\mathbf{w}\|_2 < \sqrt{\frac{4n-1}{4n}} \}$$. In this set, $$\mathbf{w} = \mathbf{0}$$ corresponds to the local minima ($$\mathbf{A}_0 = \mathbf{I}$$).

#### Some direction to descend, always

The objective $$g(\mathbf{w}, \mathbf{X}_0)$$ is well behaved in $$\Gamma$$, in the sense that there is always some direction to descend with high probability. It is shown that $$\Gamma$$ is a union of three regions $$R_{I}$$, $$R_{II}$$ and $$R_{III}$$ in increasing norm of $$\mathbf{w}$$.

It holds with high probability that (Theorem 2.1)

1. For all $$\mathbf{w} \in R_{I}$$, objective is strongly convex: $$\nabla^2 g(\mathbf{w}; \mathbf{X}_0) \geq \frac{c \theta}{\mu} \mathbf{I}$$;

2. For all $$\mathbf{w} \in R_{II}$$, objective has non-zero gradient: $$\frac{\mathbf{w}^T \nabla g}{\|\mathbf{w}\|_2} \geq c \theta$$;

3. For all $$\mathbf{w} \in R_{III}$$, objective has negative curvature: $$\frac{\mathbf{w}^T \nabla^2 g \mathbf{w}}{\|w\|_2^2} \leq -c\theta$$.

This result is shown by computing the gradient and Hessian of $$\mathbb{E}_{\mathbf{X}_0 \sim \text{BG}} g(\mathbf{w}; \mathbf{X}_0)$$ and then a concentration argume