Deep-Diving & Decoding The Secrets That Make DeepSeek So Good

Part 1: Multi-head Latent Attention (MLA)

Feb 16, 2025

∙ Paid

DeepSeek is all over the news.

Its models have achieved state-of-the-art performance across multiple benchmarks (educational, factuality, math and reasoning, coding) and are competing head-to-head with OpenAI’s o1.

Since its release, nearly $1 trillion of value has been lost in U.S. technology stocks in the S&P 500.

However, many popular beliefs about it are not correct.

A $6 million training cost for DeepSeek-R1 is being peddled all across the internet, but this is far from the truth (the official figures remain undisclosed).

This figure is actually for the official training of DeepSeek-V3 (the predecessor model for R1) and even then excludes the costs associated with prior research and ablation experiments on architectures, algorithms, or data associated with V3.

Training costs for DeepSeek-V3 assuming that the rental price of NVIDIA’s H800 GPU is $2/ GPU hour (Image from ArXiv research paper titled ‘DeepSeek-V3 Technical Report’)

Even though the costs are not accurately described, DeepSeek-V3 is still trained with significantly fewer resources than the models of the other prominent players in the LLM market.

But how has this been made possible?

Here’s a story about a groundbreaking architectural change called Multi-Head Latent Attention, which is one reason DeepSeek’s models perform so well at low training resources.

Let’s begin!

We Start Our Journey With ‘Attention’

Attention is a mechanism that allows a model to focus on different parts of the input sequence when making predictions.

The mechanism weighs the importance of each token in the sequence and captures relationships between different tokens of the sequence regardless of their distance from each other.

This helps the model decide which tokens from the input sequence are most relevant to the token being processed.

Attention is not new.

This version, called the Bahdanau Attention mechanism, uses a bi-directional RNN as an Encoder (bottom blocks in the image) to process input sequences x(1) to x(T) to generate hidden states h(1) to h(T).

The Attention mechanism computes the attention scores for each encoder hidden state, telling how relevant these are to the current decoding step.

These scores are transformed into attention weights a(t,1) to a(t,T)(T is to the total number of input tokens) using the softmax function.

The encoder’s hidden states are weighted according to these and summed to produce a context vector c(t).

At each time step t, the decoder (top blocks in the image) then generates the next output y(t) based on its current hidden state s(t) combined with the context vector c(t), previous hidden state s(t-1) and the previous output y(t-1).

Bahdanau Attention Mechanism (Image from the research paper titled ‘Neural Machine Translation by Jointly Learning to Align and Translate’)

A particular type, called Scaled Dot-Product Attention, was later introduced in the Transformer architecture, which is calculated using the following three values obtained from the input token embeddings:

Query (Q): a vector representing the current token that the model is processing.
Key (K): a vector representing each token in the sequence.
Value (V): a vector containing the information associated with each token.

These are used in the formula as follows:

Scaled Dot-Product Attention where d(k) is the dimensionality of the key vector (Image obtained from the research paper titled ‘Attention Is All You Need’ )

Transformers use this in three ways:

Self-Attention

This is used in the Transformer’s Encoder.

Here, the Queries, Keys, and Values come from the same input sequence — the previous layer’s output in the Encoder.

2. Masked Self-Attention

This is used in the Transformer’s Decoder.

Here, the Queries, Keys, and Values come from the same sequence — the output sequence generated so far with the future tokens masked.

3. Cross-Attention or Encoder-Decoder Attention

This is used in the Transformer’s Decoder.

Here, the Query comes from the Decoder’s previous layer, while the keys and values come from the Encoder’s output.

Scaled Dot-Product Attention where the ‘Scale’ operation represents multiplication with 1/ √d(k) and an optional ‘Mask’ used in the Decoder (Image obtained from the research paper titled ‘Attention Is All You Need’ )

Moving Towards Multi-Head Attention

Instead of calculating the Attention score just once, the Transformer runs multiple Attention mechanisms in parallel using multiple heads.

Each head focuses on different aspects of the input sequence (short vs. long-range dependencies, grammatical rules, etc.).

This helps the architecture capture better semantic relationships between different tokens.

Let’s learn how this works.

Let’s say that the Transformer architecture has an overall model dimension represented with d(model) . This represents the dimensionality of the input/ hidden layer representations X.

Instead of working with full-dimensional vectors, each head works with lower-dimensional projections of the Queries, Keys and Values.

These are obtained using learned projection matrices as follows:

The matrices W(Q)(i), W(K)(i), and W(V)(i) projects X into lower-dimensional Query, Key, and Value vectors for each attention head.

d(model) is reduced into smaller dimensions d(k) for Queries and Keys, and d(v) for Values as follows —

d(k) = d(v) = d(model) / h where h is the number of heads

Next, Scaled dot-product Attention is calculated for each head using its own set of projected Query, Key, and Value matrices.

These individual Attention scores are concatenated and linearly transformed using a learned matrix W(O) as shown below.

Multi-head Attention (Image obtained from the research paper titled ‘Attention Is All You Need’ )

Going back to the Transformer architecture, it is Multi-head Attention (MHA) that is actually used instead of the basic Attention mechanism (described earlier) for both Self-Attention & Cross-Attention.

The Transfromer Architecture (Image obtained from the research paper titled ‘Attention Is All You Need’ )

But MHA Is Memory Expensive At Inference

Multi-head Attention (MHA) is a powerful mechanism used by most LLMs today to capture dependencies between tokens, but it causes an issue during inference/ token prediction.

When the LLM generates a token, it must compute the attention scores with all the previous tokens.

Instead of recomputing all keys and values for previous tokens at every time step, they are stored in a Key-Value (KV) Cache.

(Queries are not cached since they are dynamically calculated for each new token, and only Keys and Values need to be reused for future tokens.)

This is a fantastic optimisation step to speed up inference, but as the sequence length increases, the number of stored key-value pairs grows linearly with it.

For a transformer with L layers, n(h) heads per layer, and the per-head dimension of d(h), 2 x n(h) x d(h) x Lelements need to be cached for each token.

This is because, for each token, each head independently stores a separate key and value vector of size d(h), and there are n(h) heads per layer and L layers in total.

This cache can become massive over time, especially in long-context models, leading to massive GPU memory usage during cache retrieval and slower inference.

Let’s learn how this memory bandwidth bottleneck is overcome in modern LLMs.

Levelling Up To Multi-Query Attention (MQA)

Unlike traditional Multi-head Attention (MHA), which caches separate key-value pairs per head, Multi-Query Attention (MQA) shares a single set of keys and values across all heads.

This reduces the KV cache size from 2 x n(h) x d(h) x L (as in MHA) to 2 x d(h) x L.

Since only one set of keys and values needs to be fetched, this reduces GPU memory usage, allowing large batch sizes to be processed at inference time.

Multi-head Attention vs Multi-query Attention (Image obtained from the ArXiv research paper titled ‘GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints’)

LLMs like PaLM and Falcon use MQA instead of MHA.

But MQA isn’t perfect.

Since all heads share the same Keys and Values, this reduces effective learned representations, making the LLM less expressive and struggling to track long-range dependencies.

It is also seen that using MQA leads to instability during LLM fine-tuning, particularly with long input tasks.

What’s the fix?

Grouped-Query Attention (GQA) To The Rescue

Published by a team of Google researchers, Grouped-Query Attention (GQA) is a tradeoff between MHA and MQA.

Instead of one KV pair per head (like in MHA) or one KV pair for all heads (like in MQA), GQA groups multiple heads together that share a single KV pair.

Each group processes its own set of queries but shares the same keys and values.

This reduces the KV cache size from 2 x n(h) x d(h) x L (as in MHA) to 2 x d(h) x G, where G is the number of groups.

This makes the inference much faster and, at the same time, allows the LLM to learn its representations effectively.

Visual representation of the different Attention mechanisms: MHA, GQA, MQA (Image obtained from the ArXiv research paper titled ‘GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints’)

To summarise —

MHA's generation quality is the best, but its inference speed is the lowest.

MQA leads to the fastest inference speed but the lowest generation quality.
GQA balances and interpolates between these two.

For G = 1, GQA works similarly to MQA, while for G = n(h), it works similarly to MHA. The best performance is observed at G = 4 to 8, where G is the number of groups and n(h) is the number of heads, respectively. (Image obtained from the ArXiv research paper titled ‘GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints’)

Keep reading with a 7-day free trial

Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.