### Current browse context:

stat.ML

### Change to browse by:

### References & Citations

# Statistics > Machine Learning

# Title: Stationary Points of Shallow Neural Networks with Quadratic Activation Function

(Submitted on 3 Dec 2019 (v1), revised 20 Feb 2020 (this version, v2),

*latest version 9 Jul 2020*(v3))Abstract: We consider the problem of learning shallow neural networks with quadratic activations and planted weight matrix $W^*\in\mathbb{R}^{m\times d}$, where $m$ is the width of the hidden layer and $d\leqslant m$ is the dimension of data having centered i.i.d. coordinates with finite fourth moment. We establish that the landscape of the population risk $\mathcal{L}(W)$ admits an energy barrier separating rank-deficient $W$: if $W\in\mathbb{R}^{m\times d}$ with ${\rm rank}(W)<d$, then $\mathcal{L}(W)$ is bounded away from zero by an amount we quantify. We then establish that all full-rank stationary points of $\mathcal{L}(\cdot)$ are necessarily global optimum. These two results propose a simple explanation for the success of gradient descent in training such networks, when properly initialized: gradient descent algorithm finds a global optimum due to the absence of spurious stationary points within the set of full-rank matrices.

We then show that if $W^*\in\mathbb{R}^{m\times d}$ has centered i.i.d. entries with unit variance, finite fourth moment; and is sufficiently wide, that is $m>Cd^2$ for a large $C$, then it is easy to construct a full rank matrix $W$ with population risk below the energy barrier, starting from which gradient descent is guaranteed to converge to a global optimum.

Our final focus is on sample complexity: we identify a simple necessary and sufficient geometric condition, not retrospective in manner, on the training data under which any minimizer of the empirical loss has necessarily zero generalization error. We show that as soon as $n\geqslant n^*=d(d+1)/2$, random data enjoys this geometric condition almost surely. At the same time we show that if $n<n^*$, then when the data has centered i.i.d. coordinates, there always exists a matrix $W$ with zero empirical risk, but with population risk bounded away from zero by the same amount as rank deficient matrices.

## Submission history

From: Eren Can Kızıldağ [view email]**[v1]**Tue, 3 Dec 2019 18:52:37 GMT (29kb)

**[v2]**Thu, 20 Feb 2020 16:21:23 GMT (39kb)

**[v3]**Thu, 9 Jul 2020 22:02:14 GMT (87kb)

Link back to: arXiv, form interface, contact.