BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//128.220.13.64//NONSGML kigkonsult.se iCalcreator 2.20//
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:Johns Hopkins Algorithms and Complexity
X-WR-CALDESC:
X-FROM-URL:https://www.cs.jhu.edu/~mdinitz/theory
X-WR-TIMEZONE:America/New_York
BEGIN:VTIMEZONE
TZID:America/New_York
X-LIC-LOCATION:America/New_York
BEGIN:STANDARD
DTSTART:20221106T020000
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
TZNAME:EST
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20230312T020000
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
TZNAME:EDT
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:ai1ec-338@www.cs.jhu.edu/~mdinitz/theory
DTSTAMP:20221126T124332Z
CATEGORIES:
CONTACT:
DESCRIPTION:Speaker: Jingfeng Wu\nAffiliation: Johns Hopkins University\nTi
tle: Direction Matters: On the Implicit Regularization Effect of Stochasti
c Gradient Descent with Moderate Learning Rate\nAbstract:\nUnderstanding t
he algorithmic regularization effect of stochastic gradient descent (SGD)
is one of the key challenges in modern machine learning and deep learning
theory. Most of the existing works\, however\, focus on very small or even
infinitesimal learning rate regime\, and fail to cover practical scenario
s where the learning rate is moderate and annealing. In this paper\, we ma
ke an initial attempt to characterize the particular regularization effect
of SGD in the moderate learning rate regime by studying its behavior for
optimizing an overparameterized linear regression problem. In this case\,
SGD and GD are known to converge to the unique minimum-norm solution\; how
ever\, with the moderate and annealing learning rate\, we show that they e
xhibit different directional bias: SGD converges along the large eigenvalu
e directions of the data matrix\, while GD goes after the small eigenvalue
directions. Furthermore\, we show that such directional bias does matter
when early stopping is adopted\, where the SGD output is nearly optimal bu
t the GD output is suboptimal. Finally\, our theory explains several folk
arts in practice used for SGD hyperparameter tuning\, such as (1) linearly
scaling the initial learning rate with batch size\; and (2) overrunning S
GD with high learning rate even when the loss stops decreasing.
DTSTART;TZID=America/New_York:20201104T120000
DTEND;TZID=America/New_York:20201104T130000
SEQUENCE:0
SUMMARY:[Theory Seminar] Jingfeng Wu
URL:https://www.cs.jhu.edu/~mdinitz/theory/event/theory-seminar-jingfeng-wu
/
X-COST-TYPE:free
X-ALT-DESC;FMTTYPE=text/html:\\n\\n\\n\\n\\nSpeaker: Jing
feng Wu

\nAffiliation: Johns Hopkins University

\nTitle: Direct
ion Matters: On the Implicit Regularization Effect of Stochastic Gradient
Descent with Moderate Learning Rate

\nAbstract:

\nUnderstanding
the algorithmic regularization effect of stochastic gradient descent (SGD
) is one of the key challenges in modern machine learning and deep learnin
g theory. Most of the existing works\, however\, focus on very small or ev
en infinitesimal learning rate regime\, and fail to cover practical scenar
ios where the learning rate is moderate and annealing. In this paper\, we
make an initial attempt to characterize the particular regularization effe
ct of SGD in the moderate learning rate regime by studying its behavior fo
r optimizing an overparameterized linear regression problem. In this case\
, SGD and GD are known to converge to the unique minimum-norm solution\; h
owever\, with the moderate and annealing learning rate\, we show that they
exhibit different directional bias: SGD converges along the large eigenva
lue directions of the data matrix\, while GD goes after the small eigenval
ue directions. Furthermore\, we show that such directional bias does matte
r when early stopping is adopted\, where the SGD output is nearly optimal
but the GD output is suboptimal. Finally\, our theory explains several fol
k arts in practice used for SGD hyperparameter tuning\, such as (1) linear
ly scaling the initial learning rate with batch size\; and (2) overrunning
SGD with high learning rate even when the loss stops decreasing.

\n
END:VEVENT
END:VCALENDAR