A Detailed History of Optimizers (And How The New ‘Adam-mini’ Optimizer Works)

A deep dive into how Optimizers work, the history of their development, and how the novel 'Adam-mini' optimizer enhances LLM training like never before

Jul 21, 2024

An Optimizer forms the basis for training most modern neural networks.

Published in 2017, the Adam Optimizer, along with its variants, has become the dominant and go-to optimizer for training LLMs in the industry today.

But there’s an issue with Adam that has been largely overlooked due to its superior performance.

That issue is Memory inefficiency.

To train an LLM with 7 billion parameters, Adam requires around 86 GB of memory.

For models like Google PaLM, which consists of 540 billion parameters, more than 50 GPUs are needed just to contain Adam itself.

But maybe not anymore. Here’s some exciting news!

A team of ML researchers have developed a better version of Adam called Adam-mini.

The Adam-mini optimizer is twice as memory efficient and achieves 49.6% higher throughput than AdamW when used to train billion-parameter LLMs.

This is a story where we deep dive into how Optimizers work, how they were developed, what their limitations are, and how Adam-mini solves some of these to serve as a breakthrough for the future of deep learning.

But First, What Are Optimizers?

An Optimizer is an algorithm that adjusts an ML model's parameters (weights and biases) to minimize the Loss function.

This leads to a more accurate model during model training.

Optimizer visualised (Image from author’s book ‘AI In 100 Images’)

We must start with the most basic optimization algorithm, called Gradient Descent, to understand how modern Optimizers work.

Let’s learn about it.

What Is Gradient Descent?

Gradient Descent is the foundational mathematical optimization algorithm that aims to minimize a loss function for an ML model iteratively.

During model training, it starts with an initial set of parameters and calculates the gradient of the loss function for each parameter.

(For those new to this term, the gradient is a vector of the first-order partial derivatives of the loss function with respect to the model parameters.

The loss function's gradient indicates its direction and rate of change with respect to each parameter.)

Next, using these gradient calculations, the optimizer iteratively updates the model’s parameters in the opposite direction to the gradient to reduce the loss function.

This can be thought of as following the direction of the slope of the surface of a mountain (here, created by the loss function) downhill until we reach a valley.

Gradient Descent visualised (Image from author’s book ‘AI In 100 Images’)

The algorithm can be mathematically expressed as —

Gradient Descent Update Rule, where `θ` is a model parameter, `α` is the learning rate that controls the step size during Gradient Descent, and`∇L(θ)` is the gradient of the loss function with respect to the parameter `θ` (Image created by author)

Variants of Gradient Descent

The Gradient Descent algorithm has three variants based on how the loss function’s gradient is calculated.

Batch Gradient Descent (BGD)

In this variant, the entire dataset is used to compute the gradient of the loss function with respect to the parameters to perform just one update of the model parameters.

Batch Gradient Descent Update Rule where ‘`m’` is the number of training data points and `∇L(i)(θ)` is the gradient of the loss function with respect to the parameter `θ` for the `i`-th data point (Image created by author)

This approach leads to stable convergence but can be very slow and memory-inefficient for large datasets.

2. Stochastic Gradient Descent (SGD)

In this variant, the gradient is computed using one training data point at a time for one update of the model parameters.

Stochastic Gradient Descent Update Rule where `∇L(i)(θ)` is the gradient of the loss function with respect to the parameter `θ` for the `i`-th data point (Image created by author)

This approach is fast but leads to noise when the gradients are updated.

3. Mini-Batch Gradient Descent (MBGD)

In this variant, a subset of the training dataset (mini-batch) is used to compute the gradient of the loss function with respect to the parameters to perform one update of the model parameters.

Mini-Batch Gradient Descent Update Rule where '`B’` is a mini-batch of training data points and `∇L(i)(θ)` is the gradient of the loss function with respect to the parameter `θ` for the `i`-th data point (Image created by author)

This approach offers a good balance between the deficiencies in Batch Gradient Descent and Stochastic Gradient Descent.

However, it still has many shortcomings, which leads to sub-optimal convergence.

Here are a few major ones —

The same learning rate is applied to all parameter updates
The learning rate is fixed and does not adapt to the training dataset’s features
For non-convex loss functions, the algorithm can get trapped in local minima and on saddle points (shown in the image below)

A Saddle point (shown in red) on the graph of a Hyperbolic Paraboloid (Image by Nicoguaro on Wikimedia Commons)

Modern Optimizers To The Rescue

Most modern optimizers build upon the deficiencies of Gradient Descent and its variants.

(It must also be noted that many other optimization techniques, such as Particle Swarm Optimization and Bayesian Optimization, exist but do not rely on Gradient Descent.)

Let’s learn about some intermediate optimizers that lead to Adam and its variants (AdamW).

Momentum

Momentum builds on the shortcoming of Stochastic Gradient Descent (SGD), which is that it tends to get stuck in local minima.

The idea behind it is to build inertia using a velocity term that accumulates past gradients and helps maintain a consistent direction during parameter updates.

Momentum Update Rule where θ(t) and θ(t+1) is the parameter at iterations t and t+1 , respectively, v(t) is the velocity at iteration t , β is called the momentum factor, η is the learning rate and ∇θL(θ(t)) is the gradient of the loss function with respect to the parameter θ at iteration t (Image created by author)

AdaGrad (Adaptive Gradient)

AdaGrad fixes the shortcoming of Stochastic Gradient Descent (SGD), which is that it uses a fixed learning rate for all parameters of the ML model.

Instead, AdaGrad adjusts the learning rate based on the parameters, making larger updates on infrequent parameters and smaller updates on frequent ones.

This is done by accumulating the squared gradients for each parameter and using this value to scale the learning rate.

AdaGrad Update Rule where `θ(t)` and `θ(t+1)` is the parameter at iterations `t` and `t+1` , respectively, G(t) is the accumulated sum of the squares of the past gradients, `η` is the learning rate, ϵ is a small constant that prevents zero division error, and `∇θL(θ(t))` is the gradient of the loss function with respect to the parameter `θ` at iteration `t` (Image created by author)

RMSProp (Root Mean Square Propagation)

RMSProp, like AdaGrad, adapts to the learning rate for each parameter based on past gradients.

But, since AdaGrad does this by keeping track of all past squared gradients, it leads to a continuously decreasing learning rate.

RMSProp solves this by using an exponentially decaying average of past squared gradients, leading to a consistent learning rate.

RMSProp Update Rule where `θ(t)` and `θ(t+1)` is the parameter at iterations `t` and `t+1` , respectively, `η` is the learning rate, β is the decay rate for the moving average of squared gradients, E[g²](t-1) is the exponentially decaying average of past squared gradients up to iteration t-1, ϵ is a small constant that prevents zero division error, and `∇θL(θ(t))` is the gradient of the loss function with respect to the parameter `θ` at iteration `t` (Image created by author)

But could a better optimizer be created?

Here Comes ‘Adam’ (Adaptive Moment Estimation)

Adam, published in 2017, has become the backbone of training most neural networks today.

It combines the qualities of Momentum and RMSProp by tracking the gradients' first and second-moment estimates.

First Moment Estimate (Mean of the gradients)

Similar to Momentum, it keeps track of an exponentially decaying average of past gradients.

First Moment Estimate with β1 as its exponential decay rate (Image created by author)

2. Second Moment Estimate (Uncentered Variance of the gradients)

Similar to RMSProp, it maintains an exponentially decaying average of previous squared gradients.

Second Moment Estimate with β2 as its exponential decay rate (Image created by author)

These terms are bias-corrected, leading to the following update rule for Adam.

Adam Update Rule where `θ(t)` and `θ(t+1)` is the parameter at iterations `t` and `t+1` , respectively, `η` is the learning rate, m^t and v^t are bias-corrected first and second-moment estimates, respectively, and ϵ is a small constant that prevents zero division error (Image created by author)

Taking ‘Adam’ A Step Ahead With ‘AdamW’

Adam was further improved when an approach called AdamW was proposed in 2019.

AdamW decouples the Weight decay (L2 regularization) term from the gradient update process.

The Weight decay term (ηλθ(t)) is rather added directly to Adam's parameter update rule.

AdamW Update Rule where λ is the Weight decay coefficient that controls the strength of L2 regularization, η is the learning rate (Image created by author)

This helps AdamW attain better generalization performance than the standard Adam optimizer.

AdamW is widely adopted in the industry today, and Meta’s Llama family of LLMs have been notably trained using this optimizer.

But Wait, There's More. It's Time For ‘Adam-mini’!

AdamW isn’t perfect.

It is highly memory intensive.

It requires a lot of memory to store the first-order and second-order moment estimates, which is at least twice the memory of the model size.

To make this work, CPU offloading (transferring the computational workload from the GPU to the CPU) and Sharding (distributing the computation workload across multiple GPUs) have been used.

Unfortunately, these increase the latency and, in turn, slow down model training.

Hence, a new approach to modify AdamW was proposed.

AdamW assigns an individual learning rate to each parameter based on the gradient moment estimates in its current implementation.

This means that if we have a billion-parameter model, AdamW will allocate a billion different learning rates.

Effective learning rate assigned to the i-th parameter where η is the base learning rate, v(i) is the i-th component of the second moment estimate for parameter θ, and ϵ is a small constant to avoid zero division error (Image created by author)

The researchers questioned this notion, asking —

“Is it really necessary to use an individual learning rate for each parameter?”

The answer to this was found to be — No.

They noticed that the Hessian Matrix of a Transformer has a near-block diagonal structure, where diagonal elements are significantly larger than the off-diagonal elements.

(For those new to this term, a Hessian Matrix is a matrix consisting of the second-order partial derivates of the loss function with respect to the model parameters. It captures the loss function’s curvature and helps optimise model training.)

These dense sub-blocks are of different sizes and consist of groups of closely related parameters.

Specifically, the different components of the Transformer architecture — namely, the Query, Key, Value, and MLP layers — form these smallest distinct sub-blocks.

The Hessian Matrix of a Transformer (Image from original research paper)

This insight led the researchers to develop Adam-mini.

In this algorithm, the model parameters are first partitioned into blocks based on the Transformer’s Hessian structure.

The Query and Key parameters are partitioned by heads, and the default PyTorch partition is used for all other parameters.

Next, a single learning rate is chosen for all of the parameters in each block, calculated by averaging the second moment estimates within that block.

Learning rates used by different Optimizers (SGD uses a single learning rate, Adam-mini uses a block-wise learning rate, and Adam uses individual learning rates for each parameter (Image from original research paper)

For example, for a model with five parameters, AdamW assigns each parameter its own learning rate based on the second moment estimate as follows.

Learning rates assigned to each parameter ‘i’ by AdamW where η is the base learning rate and v(i) is the second-moment estimate for the i-th parameter (Image created by author)

In Adam-mini, on the contrary, parameters within each block are assigned a single learning rate by averaging the second moment estimates within that block, which greatly reduces the total number of learning rates used.

Learning rates assigned to parameters partitioned into two blocks (1,2,3) and (4,5) where η is the base learning rate, and v(i) is the second-moment estimate for the i-th parameter (Image from original research paper)

The detailed second-moment estimate calculation is shown below.

Second moment estimate for the k-th block at time step t where B(k) is the set parameters in the k-th block. Compare this with the second moment estimate of Adam to understand better. (Image created by author)

The Performance of ‘Adam-mini’

The overall performance of Adam-mini is quite exceptional, as demonstrated by various indicators.

Memory Efficiency

Adam-mini can reduce more than 90% of the second-moment estimates (denoted by v) used by different LLMs.

This saves up to 45% to 50% memory compared to AdamW.

Memory cost of AdamW v.s. Adam-mini (Image from original research paper)

Throughput

Adam-mini largely reduces the communication among CPUs and GPUs due to its memory efficiency.

Also, Adam-mini’s update rules do not introduce any extra computations. All they do is average the second-moment estimates, which is a computationally cheap operation.

These lead Adam-mini to achieve a 50% higher throughput than AdamW, reducing 33% wall-clock time for pre-training Llama2–7B.

Throughput noted with pre-training Llama2–7B on two A800–80GB GPUs; ‘X’ denotes Out-of-memory, and ‘bs’ stands for Batch size (Image from original research paper)

Performance on Llama2–7B pre-training (Image from original research paper)

Performance On LLM Pre-Training

On Pre-training open-source LLMs from the Llama and GPT-2 series from scratch, Adam-mini matches the performance of AdamW while utilizing less memory.

Training curve of TinyLlama-1B pre-training (Image from original research paper)

Training curve of Llama2–7B pre-training (Image from original research paper)

Training curves of GPT-2 pre-training (Image from original research paper)

It is also noted that Adam-mini is not sensitive to hyperparameters and maintains a stable validation loss during training.

(a) Training curve and (b) Sensitivity analysis on GPT2–125M pre-training task where Validation loss is plotted against different β2 values (exponential decay rate of the gradients’ second-moment estimates and learning rates (Image from original research paper)

Performance on LLM Supervised Fine-Tuning

Supervised fine-tuning (SFT) of the Pre-trained Llama-2–7b model with and without LoRA (Low-Rank Adaptation) shows that Adam-mini performs better in terms of smaller evaluation Perplexity with less memory usage.

Training curves of (a) SFT (LoRA) and (b) SFT when aligning Llama2–7B (Image from original research paper)

Similar findings regarding higher Evaluation rewards are noted when Reinforcement Learning from Human Feedback (RLHF) is implemented on the same LLM.

Training curves of RLHF when aligning Llama2–7B (Image from original research paper)

Next, as evaluated on the MT-Bench benchmark, the model's fine-tuning performance in terms of its chat ability shows that Adam-mini outperforms AdamW in each downstream task.

Performance of AdamW vs. Adam-mini as evaluated using MT-Bench (Image from original research paper)

Performance on Non-LLM Tasks

The researchers also devised a version of Adam-mini for non-LLM neural networks.

The algorithm first groups parameters into blocks, by layers or other logical divisions within the model.

The second-moment estimates within each block of parameters are then averaged, and based on this, each block of parameters is assigned a single learning rate.

It is noted that Adam-mini performs at par or better than AdamW on all popular Non-LLM tasks.

Performance of AdamW vs. Adam-mini on non-LLM tasks (Image from original research paper)

These results are groundbreaking!

They will enable future researchers to train LLMs (and other deep neural network architectures) with fewer GPUs, reducing costs and energy use, further accelerating ML research, and democratising AI development.

What are your thoughts on the Adam-mini optimizer? Have you tried using it in your projects yet? Let me know in the comments below.

Into AI

Discussion about this post