Memory Layers Are Supercharging LLMs Like Never Before
A deep dive into how Memory layers work and how they supercharge LLMs, so much so that next-generation AI architectures will miss out if they do not use these.
LLMs are massive knowledge stores of information within their parameters (primarily as the weights of linear matrix transforms in the dense layers).
However, as the parameter size grows, so does the computational and energy cost.
Can these be replaced with simple and cheap key-value lookup mechanisms?
Much previous research has been done to address this, but never at the scale of the AI architecture at present.
But Meta researchers have finally figured things out and developed Memory layers that can supercharge our current LLMs.
These layers replace the Feed-forward network (FFN) of one or more Transformer layers.
And the results are surprisingly good!

Memory layers improve the factual accuracy of LLMs by over 100%, along with improvements in coding performance and general knowledge comparable to conventional LLMs trained on 4X more computing!
These memory-layers-augmented LLMs also beat the Mixture-of-experts LLM architectures trained on similar compute and parameter size, especially on factual tasks.
Here is a story where we deep dive into how Memory layers work and how they supercharge LLMs, so much so that next-generation AI architectures will miss out if they do not use these.
What Are Memory Layers?
Memory layers work similarly to the Attention mechanism in a Transformer.
Given Query (Q
), Keys (K
) and Values (V
), they output a weighted sum of the Values (V
), where the weights are determined by the similarity between the query and the keys using a softmax function.
However, two major differences separate memory layers from conventional Attention.
First — unlike the Attention mechanism (where keys and values are computed dynamically for each query), keys and values are trainable parameters in memory layers that are learned and stored persistently.
Second — the number of key-value pairs used in Memory layers is huge (in millions).
Only the top-k most similar keys and corresponding values are used to calculate the output, making the lookup and updates computationally efficient at this scale.
A Memory layer can be described using the following equations where:
Indices (
I
) are first calculated for the top-k keys selected based on their similarity to the query.
Similarity scores (
K(I)q
) are then calculated for the selected keys and normalized using Softmax to obtain weights (s
).
Finally, the output (
y
) is calculated using the weighted sum of the top-k selected values.
Each token embedding passes through a Memory layer independently, similar to the Feed-forward layers in a conventional Transformer.
How Are Similar Keys Searched At Scale?
It is computationally expensive to find the most similar keys for a query.
A naive nearest-neighbour search will:
compute a similarity score (such as Cosine similarity) between a query and keys (time complexity:
O(N ⋅ n)
forN
key of dimensionalityn
)sort the keys by the similarity scores (time complexity:
O(N log(N)
forN
keys)select the top-k keys with the highest similarity scores
calculate the final output using the top-k keys
The memory cost of the above approach is O(N ⋅ n)
for N
keys of dimensionality n
.
This is infeasible, given there are millions of keys.
Keep reading with a 7-day free trial
Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.