Into AI

Into AI

Share this post

Into AI
Into AI
You Don’t Need Normalization In Transformers Anymore

You Don’t Need Normalization In Transformers Anymore

A deep dive into the internals of Layer Normalization, and how a simple function called Dynamic Tanh (DyT) can replace them entirely in the Transformer architecture without any loss in performance.

Dr. Ashish Bamania's avatar
Dr. Ashish Bamania
Aug 15, 2025
∙ Paid
7

Share this post

Into AI
Into AI
You Don’t Need Normalization In Transformers Anymore
3
Share
Image generated with Google ImageFX

Normalization layers are everywhere.

They are considered essential and irreplaceable, and all neural network architectures, including Transformers, use them as the default.

A group of researchers from Meta has just published new research that challenges this norm.

They introduce a simple element-wise operation called Dynamic Tanh (DyT), which can easily and entirely replace normalization layers in Transformers.

Experiments show that such replacements result in an architecture that matches (even exceeds) the performance of conventional Transformers with normalization, without requiring any hyperparameter tuning.

Similar results are observed in all kinds of experiments, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models.

Here is a story where we take a deep dive into how Normalization works internally and how its function can be replicated and replaced by using the simple Dynamic Tanh (DyT) operation in various neural network architectures.

Let’s begin!


My latest book, called “LLMs In 100 Images”, is now out!

It is a collection of 100 easy-to-follow visuals that describe the most important concepts you need to master LLMs today.

Grab your copy today at a special early bird discount using this link.


To Start With, What Really Is Normalization?

Neural networks are notoriously hard to train.

During training, the distribution of inputs to each layer in a neural network can shift as the parameters in earlier layers change. This is called Internal covariate shift.

To fix this unwanted distribution shift, a technique called Normalization is used, which adjusts and scales the outputs (activations) of neurons in the neural network.

Take an input tensor x with shape (B,T,C) where:

  • B is the batch size/ number of samples

  • T is the number of tokens in a sample

  • C is the embedding dimension

For example, a batch of 32 sentences (B), each tokenized into 128 tokens (T), and each token represented as a 768-dimensional vector (C), has a shape of (32, 128, 768).

Normalization is applied to this tensor using the formula:

where:

  • γ and β are learnable parameters of shape (C, ) for scaling and shifting the output (affine transformation)

  • ϵ is a small constant added to the denominator to prevent division by zero (without it, it could lead to exploding gradients)

  • μ and σ² are the mean and variance of the input tensor. These are computed differently based on the method being used, as discussed next.

It was in 2015 when researchers at Google published a research paper on Batch Normalization.

Batch Normalization (BatchNorm or simply BN), which was primarily intended to be used in CNNs, involves computing the mean μ and variance σ² for each channel C indexed by k across both the batch (B) and token dimensions (T) as:

Mean in the Batch Normalization formula is computed for each channel, indexed by k, across the batch and token dimensions, indexed by i and j, respectively.
Variance in the Batch Normalization formula is computed for each channel, indexed by k, by averaging the squared deviations from the mean across the batch and token dimensions, indexed by i and j, respectively.

(Note that we use the term “Channel” to refer to C when we talk about CNNs/ Vision models. It is also called “Feature dimension” in a general machine learning context and “Embedding dimension” in the context of language or sequence models.)

BatchNorm soon began to be applied in a variety of vision models and became widely successful (the paper has been awarded the ICML Test Of Time Award 2025).

Following it came many other types of normalization layers, namely:

  • Instance Normalization (in 2016)

  • Group Normalization (in 2018)

  • Layer Normalization (in 2016) and RMS Layer Normalization (in 2019)

As we previously discussed, the difference between each lies in how the mean and variance are calculated over the input.

While BatchNorm calculates them across the batch (B) and token (T) dimensions for each channel (C), they are calculated in:

  • InstanceNorm: across tokens (T), for each sample (B) and each channel (C)

  • GroupNorm: across groups of channels (C) and tokens (T), for each sample (B)

  • LayerNorm: across all channels (C), for each sample (B) and each token (T)

In the context of CNNs, LayerNorm computes these statistics across both channels (C) and spatial dimensions (H x W), for each sample (represented by B or N in the image), as shown below.

Comparison between different normalization methods. C represents the number of Channels/ embedding dimension, N is the Batch size (called B in text), and H x W represents the number of tokens (T = H × W) (Source: ArXiv research paper titled ‘Group Normalization’)

While GroupNorm and InstanceNorm are used to improve object detection and image stylization, LayerNorm (and its variant RMSNorm) has become the de facto layer in Transformer-based architectures.

In LayerNorm, the mean (per sample and token) is calculated as:

Mean in the Layer Normalization formula is computed across all channels C, indexed by k, independently for each sample i (out of a total of B) and each token j (out of a total of T).

And the variance is calculated as:

Variance in the Layer Normalization formula is computed across all channels C, indexed by k, independently for each sample i (out of a total of B) and each token j (out of a total of T).

The general normalization formula that we previously discussed:

becomes the following for LayerNorm (LN):

Layer normalization visualised (Image from author’s book “LLMs In 100 Images”)

Building on this, a 2019 research paper introduced Root Mean Square Layer Normalization (RMSNorm), a computationally simpler and thus more efficient alternative to LayerNorm.

It removes the step of subtracting the mean from the input tensor (Mean centering) and normalizes the input using the RMS or Root Mean Square value as below:

This makes the RMSNorm formula (note the emission of the affine shift β term):

RMSNorm is used today in LLaMA, Mistral, Qwen, DeepSeek, and OpenELM series of LLMs. GPT-2 on the other hand, uses LayerNorm.

It is seen that normalization layers help optimize training and enable neural networks to achieve faster convergence with better generalization.

Additionally, there has been a lot of work that replaces attention or convolution layers in deep neural network architectures (such as MLP-Mixer and Mamba), but replacing normalization layers isn’t talked about much.

A natural question arises at this point: What do Normalization layers do internally that leads to such impressive results?

Let’s talk about this next.


What Do Normalization Layers Do Internally?

To answer this question, researchers take three different Transformer models, namely:

  • a Vision Transformer model (ViT-B) trained on the ImageNet-1K dataset

  • a wav2vec 2.0 Large Transformer model trained on the Librispeech audio dataset

  • a Diffusion Transformer (DiT-XL) trained on the ImageNet-1K dataset

All of the above models have LayerNorm applied in every Transformer block and before the final linear projection.

The following visualisation shows the Transformer architecture as a refresher.

Transformer visualised (Image from author’s book “LLMs In 100 Images”)

A mini-batch of samples is used during the forward pass through these models.

Next, the tensor inputs and outputs (measured before the scaling and shifting operations) from normalization layers at varying depths in the network are measured and plotted to see how the normalization layer affects them.

The following is how these plots look.

Output (on y-axis) vs. input (on x-axis) of four different LayerNorm layers at different depths in ViT, wav2vec 2.0, and DiT models. Plots on the left-sided columns are for the earlier layers.

Keep reading with a 7-day free trial

Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Dr. Ashish Bamania
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share