Post

Exploring the gradient noise scale

A demonstration of critical batch size, and a tribute to open-access machine learning.

This visualization takes deep inspiration from Gabriel Goh’s “Why Momentum Really Works”.

It is my tribute to Distill, a journal that profoundly shaped my perspective as a young aspiring researcher in the late 2010s, but which now rests on indefinite hiatus.

Learning Rate
0.0050
0.001 0.01 0.1 1.0
Batch Size
B = 2048

To explore the nuances of the relationship between critical batch size and learning rate, I recommend reading one of the landmark papers by Gao et al. “An Empirical Model of Large-Batch Training”

This post is licensed under CC BY 4.0 by the author.