Speaker: Jingfeng Wu

\nAffiliation: Johns Hopkins Unive
rsity

Title: Direction Matters: On the Implicit Regularization Eff ect of Stochastic Gradient Descent with Moderate Learning Rate

\nAbs
tract:

\nUnderstanding the algorithmic regularization effect of stoch
astic gradient descent (SGD) is one of the key challenges in modern machin
e learning and deep learning theory. Most of the existing works\, however\
, focus on very small or even infinitesimal learning rate regime\, and fai
l to cover practical scenarios where the learning rate is moderate and ann
ealing. In this paper\, we make an initial attempt to characterize the par
ticular regularization effect of SGD in the moderate learning rate regime
by studying its behavior for optimizing an overparameterized linear regres
sion problem. In this case\, SGD and GD are known to converge to the uniqu
e minimum-norm solution\; however\, with the moderate and annealing learni
ng rate\, we show that they exhibit different directional bias: SGD conver
ges along the large eigenvalue directions of the data matrix\, while GD go
es after the small eigenvalue directions. Furthermore\, we show that such
directional bias does matter when early stopping is adopted\, where the SG
D output is nearly optimal but the GD output is suboptimal. Finally\, our
theory explains several folk arts in practice used for SGD hyperparameter
tuning\, such as (1) linearly scaling the initial learning rate with batch
size\; and (2) overrunning SGD with high learning rate even when the loss
stops decreasing.