A Detailed History of Optimizers (And How The New ‘Adam-mini’ Optimizer Works)
A deep dive into how Optimizers work, the history of their development, and how the novel 'Adam-mini' optimizer enhances LLM training like never before
An Optimizer forms the basis for training most modern neural networks.
Published in 2017, the Adam Optimizer, along with its variants, has become the dominant and go-to optimizer for training LLMs in the industry today.
But there’s an issue with Adam that has been largely overlooked due to its superior performance.
That issue is Memory inefficiency.
To train an LLM with 7 billion parameters, Adam requires around 86 GB of memory.
For models like Google PaLM, which consists of 540 billion parameters, more than 50 GPUs are needed just to contain Adam itself.
But maybe not anymore. Here’s some exciting news!
A team of ML researchers have developed a better version of Adam called Adam-mini.
The Adam-mini optimizer is twice as memory efficient and achieves 49.6% higher throughput than AdamW when used to train billion-parameter LLMs.
This is a story where we deep dive into how Optimizers work, how they were developed, what their limitations are, and how Adam-mini solves s…
Keep reading with a 7-day free trial
Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.