Exploring the gradient noise scale

A demonstration of critical batch size, and a tribute to open-access machine learning.

Posted Aug 28, 2025

By nord

1 min read

This visualization takes deep inspiration from Gabriel Goh’s “Why Momentum Really Works”.

It is my tribute to Distill, a journal that profoundly shaped my perspective as a young aspiring researcher in the late 2010s, but which now rests on indefinite hiatus.

Objective

Learning Rate

0.0050

0.001 0.01 0.1 1.0

Batch Size

B = 2048

Steps

Run until convergence

Beta1 (Momentum)

Beta2 (RMSprop)

LR Warmup Steps

Additive Noise

% Noise

Optimizer

LR Decay

To explore the nuances of the relationship between critical batch size and learning rate, I recommend reading one of the landmark papers by Gao et al. “An Empirical Model of Large-Batch Training”

Machine Learning

This post is licensed under CC BY 4.0 by the author.

Trending Tags