A 27M Hierarchical Reasoning Model Beats OpenAI's 'o3-mini-high'
A deep dive into the Hierarchical Reasoning Model to understand its internals that help it outperform powerful reasoning models available to us today.
Reasoning well is one of the biggest challenges for AI models available today.
Most popular LLMs use Chain-of-Thought (CoT) prompting and Inference time scaling for reasoning, but they still aren’t good enough.
Even with their imperfect reasoning approach, these models have very high latency and are too expensive for everyday use.
Check out the performance of the current most powerful LLMs on the ARC-AGI benchmarks, which contain tasks that are easy to solve for humans, yet hard, or impossible, for AI.

But this is about to change.
A small Singapore-based AI lab, founded in 2024, called Sapient Intelligence, has just open-sourced and published a new AI architecture called Hierarchical Reasoning Model (HRM) that has shocked the AI community.
The HRM architecture is inspired by the human brain and uses two interdependent recurrent networks:
a slower, high-level module for abstract, deliberate reasoning (the “Controller” module)
a faster, low-level module for detailed computations (the “Worker” module)

With only 27 million parameters and only 1000 training samples, an HRM can achieve nearly perfect performance on complex Sudoku puzzles and optimal path finding challenges in large mazes.
In comparison, o3-mini-high, Claude 3.7 8K, and DeepSeek-R1 all have zero accuracy on this task.
Alongside this, an HRM outperforms all of these models on ARC-AGI-1 and ARC-AGI-2 benchmarks, directly from the inputs without any pre-training or CoT data.

Here is a story where we deep dive into how a Hierarchical Reasoning Model works and understand its internals that help it outperform powerful reasoning models available to us today.
But First, Why Can’t Our Current LLMs Reason Well?
Deep neural networks are the backbone of all the Artificial Intelligence popularly available to us today.
Deep neural networks operate on a fundamental principle: the deeper a network (the more layers it has), the better it performs.
The most successful architecture, called Transformer, that powers all our LLMs is again a deep neural network that follows this principle.
However, there’s a problem: the LLM architecture is fixed, and its depth doesn't grow with the complexity of the problem being solved.
This makes them unsuitable for solving polynomial-time problems.
LLMs also aren’t Turing complete.
(Turing-complete systems can perform any computation that can be described by an algorithm as long as it has enough time and memory.)
To work around these limitations, LLMs rely on Chain-of-Thought (CoT) prompting, which is a technique of reasoning that breaks down complex tasks into simpler intermediate steps before solving them.
However, CoT prompting involves reasoning using human language. This is different from how humans do it. For humans, language is primarily a tool for communication rather than reasoning or thought.
This also means that a single misstep can lead to the reasoning derailing completely.
Furthermore, training reasoning LLMs requires a massive amount of long CoT data, which makes this process expensive, raising a concern about whether we will run out of data to train future LLMs on.
Alongside this, generating numerous tokens for complex reasoning tasks results in slow response times and increased use of computational resources at inference/ test time.
What Can We Learn About Reasoning From The Human Brain?
While LLMs use explicit natural language for reasoning, humans reason in a latent space without constant translation back and forth to language.
Following this insight, researchers from Meta published a technique called Chain of Continuous Thought (CoConuT) in 2024, which outperformed CoT in many logical reasoning tasks while using fewer reasoning tokens during inference.
Since then, many such techniques have been introduced, but they all suffer from a limitation.
The LLMs being trained for latent reasoning aren’t deep enough, as stacking up more and more layers leads to vanishing gradients, which means no learning for the model.
LLMs also use Backpropagation through time (BPPT), which is incompatible with research on how the human brain learns.
The next natural question from here is: So, how does the human brain really learn and reason?
We do not have a complete answer to this question, but we know that the brain is structured in layers or different levels, and these levels process information at different speeds.
The low-level regions react fast to sensory inputs like vision, and for movement, and the high-level regions are used for integrating information over longer timescales and slow computations, like abstract planning.
The slow, higher-level areas guide the fast, lower-level circuits that then execute a task. This is evident by different brain waves (slow theta waves and fast gamma waves).
Both areas also use feedback loops that help refine thoughts, change decisions, and learn from experience.
This hierarchical model in the brain gives it sufficient “computational depth” for solving challenging tasks.
Could we borrow these concepts and create an AI architecture that can replicate what we know about how the human brain works?
Here Comes ‘Hierarchical Reasoning Model’
Inspired by the human brain, the Hierarchical Reasoning Model (HRM) architecture consists of four components:
Input network (
f(I)
)Low-level recurrent module (L-module represented by
f(L)
or the “Worker” module)High-level recurrent module (H-module represented by
f(H)
or the “Controller” module)Output network (
f(O)
)
An HRM performs reasoning over N
high-level cycles, each containing T
low-level timesteps. This makes the total timesteps per forward pass N × T
.
The modules f(L)
and f(H)
each keep a hidden state z(i)(L)
and z(i)(H)
respectively (where i
is the current timestep), which are initialized as z(0)(L)
and z(0)(H)
.
Forward Pass In An HRM
Given an input x
, it is first projected into a representation x̃
by the input network f(I)
.
At each timestep i
, the low-level module (L-module) updates its state based on:
its previous state
z(i-1)(L)
the high-level module’s current state
z(i-1)(H)
(which remains fixed throughout a cycle), andthe input representation
x̃
The high-level module (H-module) only updates once at the end of each cycle, i.e., every T
timesteps using the low-level or L-module’s final state at the end of that cycle.
After N
full cycles (or N X T
timesteps), a final prediction ŷ
is extracted from the hidden state of the high-level module using the output network.
A halting mechanism (which we will discuss later) decides whether the model should end its processing and use ŷ
as the final prediction, or proceed with another forward pass.
Alongside the halting mechanism, the HRM uses many specialized techniques that enable it to perform so well. Let’s discuss them next.
Hierarchical Convergence To Fix The Problem Of Early Convergence
Recurrent networks suffer from the problem of early convergence.
As training proceeds, their hidden state moves towards a fixed point, the update magnitudes shrink, and learning with each future step slows down to none.
A goal of researchers was, therefore, to slow this convergence process for effective learning in HRM.
This is done using a process called Hierarchical Convergence, which works as follows.
During each cycle with T
timesteps, the L-module converges towards a local equilibrium based on the state of the H-module.
After a cycle is complete, the H-module performs its own update based on the final state of the L-module.
In the next cycle, the L-module uses this updated state of the H-module to reach a different local equilibrium. (Think of this as the L-module resetting with each cycle guided by the H-module.)
This process allows the H-module to function as a “Controller” that directs the overall problem-solving strategy, while the L-module functions as a “Worker” executing the refinement required at each step.
If this were a normal recurrent network, it would have converged in T
timesteps. Instead, an HRM achieves stable and prolonged convergence over N x T
timesteps, leading to better performance.

PCA (Principal Component Analysis) is a technique that finds the directions (called principal components) along which a model’s hidden states vary the most.
The PCA analysis on different models is shown below.

Avoiding Backpropagation By Approximating Gradients In One Step
Standard recurrent networks use Backpropagation to compute gradients, which requires them to store the hidden states from the forward pass and then combine them with gradients during the backward pass.
Keep reading with a 7-day free trial
Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.