Exploring the gradient noise scale
A demonstration of critical batch size, and a tribute to open-access machine learning.
This visualization takes deep inspiration from Gabriel Goh’s “Why Momentum Really Works”.
It is my tribute to Distill, a journal that profoundly shaped my perspective as a young aspiring researcher in the late 2010s, but which now rests on indefinite hiatus.
Learning Rate
0.0050
0.001 0.01 0.1 1.0
Batch Size
B = 2048
To explore the nuances of the relationship between critical batch size and learning rate, I recommend reading one of the landmark papers by Gao et al. “An Empirical Model of Large-Batch Training”
This post is licensed under CC BY 4.0 by the author.