We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

stat.ML

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Statistics > Machine Learning

Title: Stationary Points of Shallow Neural Networks with Quadratic Activation Function

Abstract: We consider the problem of learning shallow neural networks with quadratic activations and planted weight matrix $W^*\in\mathbb{R}^{m\times d}$, where $m$ is the width of the hidden layer and $d\leqslant m$ is the dimension of data having centered i.i.d. coordinates with finite fourth moment. We establish that the landscape of the population risk $\mathcal{L}(W)$ admits an energy barrier separating rank-deficient $W$: if $W\in\mathbb{R}^{m\times d}$ with ${\rm rank}(W)<d$, then $\mathcal{L}(W)$ is bounded away from zero by an amount we quantify. We then establish that all full-rank stationary points of $\mathcal{L}(\cdot)$ are necessarily global optimum. These two results propose a simple explanation for the success of gradient descent in training such networks, when properly initialized: gradient descent algorithm finds a global optimum due to the absence of spurious stationary points within the set of full-rank matrices.
We then show that if $W^*\in\mathbb{R}^{m\times d}$ has centered i.i.d. entries with unit variance, finite fourth moment; and is sufficiently wide, that is $m>Cd^2$ for a large $C$, then it is easy to construct a full rank matrix $W$ with population risk below the energy barrier, starting from which gradient descent is guaranteed to converge to a global optimum.
Our final focus is on sample complexity: we identify a simple necessary and sufficient geometric condition, not retrospective in manner, on the training data under which any minimizer of the empirical loss has necessarily zero generalization error. We show that as soon as $n\geqslant n^*=d(d+1)/2$, random data enjoys this geometric condition almost surely. At the same time we show that if $n<n^*$, then when the data has centered i.i.d. coordinates, there always exists a matrix $W$ with zero empirical risk, but with population risk bounded away from zero by the same amount as rank deficient matrices.
Comments: 30 pages
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Probability (math.PR); Statistics Theory (math.ST)
Cite as: arXiv:1912.01599 [stat.ML]
  (or arXiv:1912.01599v2 [stat.ML] for this version)

Submission history

From: Eren Can Kızıldağ [view email]
[v1] Tue, 3 Dec 2019 18:52:37 GMT (29kb)
[v2] Thu, 20 Feb 2020 16:21:23 GMT (39kb)
[v3] Thu, 9 Jul 2020 22:02:14 GMT (87kb)

Link back to: arXiv, form interface, contact.