Edge of Stochastic Stability: SGD does not train neural networks as you expect it
Title
Edge of Stochastic Stability: SGD does not train neural networks as you expect it
Speaker
Pierfrancesco Beneventano - Massachusetts Institute of Technology (MIT)
Abstract:
Recent findings demonstrate that when training neural networks using full-batch (deterministic) gradient descent with step size η, the largest eigenvalue λ of the Hessian consistently stabilizes around 2/η. These results are surprising and carry significant implications for convergence and generalization. This, however, is not the case for mini-batch optimization algorithms, limiting the broader applicability of the consequences of these findings. We show that mini-batch Stochastic Gradient Descent (SGD) trains in a different regime, which we term Edge of Stochastic Stability (EoSS). In this regime, what stabilizes at 2/η is Batch Sharpness: the expected directional curvature of mini-batch Hessians along their corresponding stochastic gradients. As a consequence, λ---which is generally smaller than Batch Sharpness---is suppressed, aligning with the long-standing empirical observation that smaller batches and larger step sizes favor flatter minima. We further discuss implications for mathematical modeling of SGD trajectories.
Bio:
Pierfrancesco Beneventano is a postdoctoral researcher at MIT (25th Anniversary McGovern Fellow), mentored by Tomaso Poggio. He works in machine learning theory, focusing on the mathematical foundations of deep learning and optimization. His research investigates where stochastic gradient methods converge on nonconvex landscapes, how hyperparameters steer the solutions found in practice, and how training instabilities relate to performance. He received his PhD in Operations Research and Financial Engineering from Princeton University, where he was advised by Boris Hanin and Jason D. Lee.
When
Thursday, January 15th, 14:30
Where
Room 322, UniGe DIBRIS/DIMA, Via Dodecaneso 35