Memory Layers Are Supercharging LLMs Like Never Before

A deep dive into how Memory layers work and how they supercharge LLMs, so much so that next-generation AI architectures will miss out if they do not use these.

Jan 22, 2025

∙ Paid

LLMs are massive knowledge stores of information within their parameters (primarily as the weights of linear matrix transforms in the dense layers).

However, as the parameter size grows, so does the computational and energy cost.

Can these be replaced with simple and cheap key-value lookup mechanisms?

Much previous research has been done to address this, but never at the scale of the AI architecture at present.

But Meta researchers have finally figured things out and developed Memory layers that can supercharge our current LLMs.

These layers replace the Feed-forward network (FFN) of one or more Transformer layers.

And the results are surprisingly good!

Transformer visualised (Image from author’s book “AI In 100 Images”)

Memory layers improve the factual accuracy of LLMs by over 100%, along with improvements in coding performance and general knowledge comparable to conventional LLMs trained on 4X more computing!

These memory-layers-augmented LLMs also beat the Mixture-of-experts LLM architectures trained on similar compute and parameter size, especially on factual tasks.

Here is a story where we deep dive into how Memory layers work and how they supercharge LLMs, so much so that next-generation AI architectures will miss out if they do not use these.

What Are Memory Layers?

Memory layers work similarly to the Attention mechanism in a Transformer.

Given Query (Q), Keys (K) and Values (V), they output a weighted sum of the Values (V), where the weights are determined by the similarity between the query and the keys using a softmax function.

Equation for Scaled dot-product Attention in Transformers

However, two major differences separate memory layers from conventional Attention.

First — unlike the Attention mechanism (where keys and values are computed dynamically for each query), keys and values are trainable parameters in memory layers that are learned and stored persistently.

Second — the number of key-value pairs used in Memory layers is huge (in millions).

Only the top-k most similar keys and corresponding values are used to calculate the output, making the lookup and updates computationally efficient at this scale.

A Memory layer can be described using the following equations where:

Indices (I) are first calculated for the top-k keys selected based on their similarity to the query.

q and K represent the query and trainable keys, respectively

Similarity scores (K(I)q) are then calculated for the selected keys and normalized using Softmax to obtain weights (s).

q and K(I) represent the query and selected top-k keys, respectively

Finally, the output (y) is calculated using the weighted sum of the top-k selected values.

s represents the softmax-normalized weights and V(I) represents the selected top-k values

Each token embedding passes through a Memory layer independently, similar to the Feed-forward layers in a conventional Transformer.

How Are Similar Keys Searched At Scale?

It is computationally expensive to find the most similar keys for a query.

A naive nearest-neighbour search will:

compute a similarity score (such as Cosine similarity) between a query and keys (time complexity: O(N ⋅ n) for N key of dimensionality n)
sort the keys by the similarity scores (time complexity: O(N log(N) for N keys)
select the top-k keys with the highest similarity scores
calculate the final output using the top-k keys

The memory cost of the above approach is O(N ⋅ n) for N keys of dimensionality n.

This is infeasible, given there are millions of keys.

Into AI

Memory Layers Are Supercharging LLMs Like Never Before

A deep dive into how Memory layers work and how they supercharge LLMs, so much so that next-generation AI architectures will miss out if they do not use these.

What Are Memory Layers?

How Are Similar Keys Searched At Scale?

This post is for paid subscribers