On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

Şimşekli, Umut; Gürbüzbalaban, Mert; Nguyen, Thanh Huy; Richard, Gaël; Sagun, Levent

Full-text links:

Download:

Current browse context:

stat.ML

< prev | next >

new | recent | 1912

Statistics > Machine Learning

Title: On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

Authors: Umut Şimşekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, Levent Sagun

(Submitted on 29 Nov 2019)

Abstract: The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the \emph{classical} central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the \emph{generalized} CLT, which suggests that the GN converges to a \emph{heavy-tailed} $\alpha$-stable random vector, where \emph{tail-index} $\alpha$ determines the heavy-tailedness of the distribution. Accordingly, we propose to analyze SGD as a discretization of an SDE driven by a L\'{e}vy motion. Such SDEs can incur `jumps', which force the SDE and its discretization \emph{transition} from narrow minima to wider minima, as proven by existing metastability theory and the extensions that we proved recently. In this study, under the $\alpha$-stable GN assumption, we further establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$. To validate the $\alpha$-stable assumption, we conduct experiments on common deep learning scenarios and show that in all settings, the GN is highly non-Gaussian and admits heavy-tails. We investigate the tail behavior in varying network architectures and sizes, loss functions, and datasets. Our results open up a different perspective and shed more light on the belief that SGD prefers wide minima.

Comments:	32 pages. arXiv admin note: substantial text overlap with arXiv:1901.06053
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG); Classical Analysis and ODEs (math.CA)
Cite as:	arXiv:1912.00018 [stat.ML]
	(or arXiv:1912.00018v1 [stat.ML] for this version)

Submission history

From: Umut Şimşekli [view email]
[v1] Fri, 29 Nov 2019 16:56:02 GMT (1822kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> stat > arXiv:1912.00018

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Statistics > Machine Learning

Title: On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

Submission history