Into AI

Into AI

Share this post

Into AI
Into AI
A Detailed History of Optimizers (And How The New ‘Adam-mini’ Optimizer Works)

A Detailed History of Optimizers (And How The New ‘Adam-mini’ Optimizer Works)

A deep dive into how Optimizers work, the history of their development, and how the novel 'Adam-mini' optimizer enhances LLM training like never before

Dr. Ashish Bamania's avatar
Dr. Ashish Bamania
Jul 21, 2024
∙ Paid
4

Share this post

Into AI
Into AI
A Detailed History of Optimizers (And How The New ‘Adam-mini’ Optimizer Works)
3
Share
Image generated with DALL-E 3

An Optimizer forms the basis for training most modern neural networks.

Published in 2017, the Adam Optimizer, along with its variants, has become the dominant and go-to optimizer for training LLMs in the industry today.

But there’s an issue with Adam that has been largely overlooked due to its superior performance.

That issue is Memory inefficiency.

To train an LLM with 7 billion parameters, Adam requires around 86 GB of memory.

For models like Google PaLM, which consists of 540 billion parameters, more than 50 GPUs are needed just to contain Adam itself.

But maybe not anymore. Here’s some exciting news!

A team of ML researchers have developed a better version of Adam called Adam-mini.

The Adam-mini optimizer is twice as memory efficient and achieves 49.6% higher throughput than AdamW when used to train billion-parameter LLMs.

This is a story where we deep dive into how Optimizers work, how they were developed, what their limitations are, and how Adam-mini solves s…

Keep reading with a 7-day free trial

Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Dr. Ashish Bamania
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share