Meeting of the Minds 2022
Symposium by ForagerOne
    Skip navigation
  • arrow_back_ios
    Exit Event
  • Welcome Page
  • Presentations
  • Live Sessions
  • Login
  • Sign Up

Toward Understanding Why Adam Converges Faster Than SGD for Transformers


Presenter(s)

Yan Pan

Programs/Groups

07-400 Research Practicum

Abstract or Description

While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantage over SGD in some deep learning applications such as training transformers. However, it remains a question why Adam converges significantly faster than SGD in these scenarios. To understand this, we study if there exist function classes common in machine learning, such that Adam provably converges faster than SGD. We argue that the performance of Adam is related to the distribution of smoothness over the coordinates. We propose a new notion of smoothness called "robust smoothness" for functions and conjecture that adaptive algorithms can achieve faster convergence with large learning rate that only depends on robust smoothness if error term is allowed. Finally, we empirically demonstrate that robust smoothness captures the average smoothness better for transformers trained on language tasks.

Mentor

Yuanzhi Li

of 1
Current View
Current View

Enter the password to open this PDF file.

File name:

-

File size:

-

Title:

-

Author:

-

Subject:

-

Keywords:

-

Creation Date:

-

Modification Date:

-

Creator:

-

PDF Producer:

-

PDF Version:

-

Page Count:

-

Page Size:

-

Fast Web View:

-

Preparing document for printing…
0%

Comments

Symposium™ by ForagerOne © 2026
AboutContact UsTerms of ServicePrivacy Policy