Toward Understanding Why Adam Converges Faster Than SGD for Transformers
Yan Pan
07-400 Research Practicum
While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantage over SGD in some deep learning applications such as training transformers. However, it remains a question why Adam converges significantly faster than SGD in these scenarios. To understand this, we study if there exist function classes common in machine learning, such that Adam provably converges faster than SGD. We argue that the performance of Adam is related to the distribution of smoothness over the coordinates. We propose a new notion of smoothness called "robust smoothness" for functions and conjecture that adaptive algorithms can achieve faster convergence with large learning rate that only depends on robust smoothness if error term is allowed. Finally, we empirically demonstrate that robust smoothness captures the average smoothness better for transformers trained on language tasks.
Yuanzhi Li
Enter the password to open this PDF file.
-
-
-
-
-
-
-
-
-
-
-
-
-
-