Toward Understanding Why Adam Converges Faster Than SGD for Transformers

Presenter(s)

Yan Pan

Programs/Groups

07-400 Research Practicum

Abstract or Description

While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantage over SGD in some deep learning applications such as training transformers. However, it remains a question why Adam converges significantly faster than SGD in these scenarios. To understand this, we study if there exist function classes common in machine learning, such that Adam provably converges faster than SGD. We argue that the performance of Adam is related to the distribution of smoothness over the coordinates. We propose a new notion of smoothness called "robust smoothness" for functions and conjecture that adaptive algorithms can achieve faster convergence with large learning rate that only depends on robust smoothness if error term is allowed. Finally, we empirically demonstrate that robust smoothness captures the average smoothness better for transformers trained on language tasks.

Mentor

Yuanzhi Li

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

Presenter(s)

Programs/Groups

Abstract or Description

Mentor

Comments