LLMs Can Now Self-Evolve At Test Time Using Reinforcement Learning

A deep dive into Test-Time Reinforcement Learning (TTRL), a technique that allows LLMs to learn from test-time data using RL without ground-truth labels.

Jul 18, 2025

Large Reasoning Models (LRMs) are trained on vast amounts of long CoT reasoning data, and take a longer time at Inference to reason better.

But what if an LLM could be trained to reason better during Test/ Inference-time itself?

Researchers from Tsinghua University and Shanghai AI Lab have now made this a reality.

They have just introduced a new technique called Test-Time Reinforcement Learning (TTRL) in their recent research paper.

This technique enables LLMs to improve themselves during Inference using unlabelled test data, through Reinforcement learning (RL).

TTRL is highly effective in improving the performance of LLMs on multiple tasks.

Experiments show that it is so good that it boosts the pass@1 accuracy of Qwen-2.5-Math-7B by 211% on AIME 2024!

Alongside this, TTRL-trained LLMs nearly match the performance of those trained directly on test data with ground-truth labels using RL.

A brief overview of Test-Time Reinforcement Learning (TTRL)

Here is a story that takes a deep dive into Test-Time Reinforcement Learning (TTRL), exploring how it works and how effective it is in improving the performance of LLMs and LRMs.

Let’s begin!

My latest book, called “LLMs In 100 Images”, is now out!

It is a collection of 100 easy-to-follow visuals that describe the most important concepts you need to master LLMs today.

Grab your copy today at a special early bird discount using this link.

Let’s Start With LRMs

LRMs or Large Reasoning Models are LLMs that are trained specifically for using reasoning to solve complex problems.

LRMs visualised (Image from author’s book “LLMs In 100 Images”)

These models are trained on human-annotated data to reason better (using SFT/ RL), which gives them the ability to use long Chain-of-Thought (CoT) reasoning at inference time.

This enables them to perform remarkably well on challenging problems, such as in the benchmarks Humanity’s Last Exam and ARC-AGI.

However, as these benchmarks are updated with newer and more complex problems, the performance of LRMs declines as they struggle to solve these increasingly challenging questions.

Check out the plot below, which shows how OpenAI’s o3 achieves 60.8% accuracy on ARC-AGI-1 as compared to just 6.5% on the updated ARC-AGI-2.

A similar accuracy drop is seen for OpenAI’s o4-mini.

Performance of o3 and o4-mini falls on ARC-AGI-2 as compared to ARC-AGI-1 (Source: Arc Prize)

What if there were a way for LLMs/LRMs to adapt to test/inference-time data to understand it better before solving it?

A method for achieving this adaptation has been studied in previous research, and it is known as Test-Time Training (TTT).

Let’s discuss it next.

Training At Test Time? Yes!

Traditional LLMs, once trained, keep their parameters frozen when evaluated on test data.

Test-time Training (TTT) is different from this.

It involves adapting an LLM’s parameters at test time based on the structure and distributional properties of the test data.

This is achieved using self-supervised learning on examples synthesised from the test data and fine-tuning lightweight LoRA adapters.

This enables the LLM to handle distribution shifts, out-of-domain examples, and few-shot scenarios well.

A previous research titled ‘The Surprising Effectiveness of Test-Time Training for Few-Shot Learning’ tells how effective TTT is when it comes to training LLMs to solve tough problems from the ARC and BIG-Bench Hard (BBH) benchmarks.

Examples of how TTT can help an LLM solve challenging problems from ARC and BBH benchmarks, as compared to In-Context Learning (ICL) (Source)

TTT, as described in this research paper, boosts the performance of fine-tuned 8B Llama 3 on ARC by 27.5% and increases its accuracy on BIG-Bench Hard (BBH) by 7.3%.

Plot showing the pass@2 accuracy achieved by 8B Llama 3 reported on a subset of 80 randomly selected validation tasks from ARC and overall accuracy on BIG-Bench Hard. Note that the zero-shot accuracy of the LLM on ARC is 0% and 40.9% for BBH, indicated by the dashed line. Fine-tuning (FT) the LLM leads to a 17.5% baseline accuracy on ARC. (Source)

How about we extend TTT and make its gains more permanent?

Here Comes Test-Time Reinforcement Learning

Training data is labelled or verifiable. This provides the ground truth (for supervised learning) and reward signals (for RL) during LLM training.

Test time data, on the other hand, is unlabelled and largely unverifiable (except for programming/ mathematical tasks).

Test-Time Reinforcement Learning (TTRL) gives a solution to this problem of training a pre-trained LLM during test time using RL without ground-truth labels.

Let’s understand how it works.

Given a prompt or test-time input x (called ‘State’ in RL terms), an LLM π (called ‘Policy’ in RL terms) with parameters θ produces an output y (called ‘Action’ in RL terms).

The secret sauce of TTRL lies in producing a reward signal without ground-truth labels.

For this, the LLM first produces multiple candidate outputs y(1), y(2), …, y(N) for a given prompt.

It then selects a consensus output y* given these candidate outputs. This acts as a pseudo-label or a proxy for the optimal output/action.

This selection is done using a scoring function as follows:

The optimal output y* is selected by maximizing the scoring function s(y(i), x) based on how well each candidate output y(i) scores relative to the prompt x.

The scoring function can be implemented using different techniques such as:

Majority Voting: which involves selecting the most frequent output among the candidates
Best-of-N: which uses a reward model to score each candidate and selects the highest-scoring one
Weighted Best-of-N: which involves combining each candidate’s frequency (from majority voting) and quality score (from a reward model) using a weighted sum to select the best candidate

A reward is then calculated based on how well an output/ action y aligns with the consensus output/ action y*.

This is done using a reward function r(y, y*) that returns a reward of 1 if the model’s output y matches the consensus output y*, and 0 otherwise.

Pseudocode for the Reward function implemented using Majority voting as the Scoring function

Finally, the LLM is trained to maximize the expected reward, shown in the following equation:

The RL objective function for TTRL for maximising the expected reward from the reward function r(y, y∗), where y is sampled from the LLM policy π(θ)(y ∣ x) given the prompt x, and y* is the majority voted pseudo-label.

This is done by updating the LLM parameters (Policy optimization with GRPO) using Gradient ascent:

Equation for Gradient ascent to update model parameters θ with learning rate η

The following illustration shows the complete TTRL process.

Illustration showing the process of TTRL that combines Inference/ Test-time scaling with Test-time Training

Does TTRL Really Work Well?

Based on the above methodology, researchers implement TTRL on multiple LLMs as follows:

These LLMs are then evaluated before and after applying TTRL on:

GPQA-Diamond: a set of 198 extremely difficult, expert-validated MCQs in biology, chemistry, and physics
3 mathematical reasoning benchmarks, namely, AIME 2024, AMC, and MATH-500

The primary metric to evaluate the LLMs on each benchmark is pass@1 accuracy, which tests whether the LLM’s top response matches the correct answer.

Pass@k accuracy visualised (Image from author’s book “LLMs In 100 Images”)

Here is what the results tell!

TTRL Consistently Improves All LLMs On Most Tasks

TTRL improves the performance of all LLMs across the four highly challenging benchmarks.

On the AIME 2024, TTRL achieves at least a 105% improvement across all LLMs.

Notably, applying it to the Qwen2.5-Math, a 1.5B LLM makes it gain a 40.3% accuracy on the MATH-500 benchmark!

Performance gains by using TTRL on multiple LLMs across multiple benchmarks

Alongside these, despite being trained on unlabelled data using self-generated rewards, TTRL’s performance matches or exceeds that of other models, RL-trained on fully labelled datasets.

TTRL Even Boosts The Performance Of Strong LRMs

LRMs that have been extensively trained on massive reasoning datasets still get significantly better, and a jump of around 10% accuracy is seen in their performance after TTRL.

Plots showing the accuracy improvements of LRMs after TTRL

TTRL Works Better As Models Get Bigger

As the model parameters increase, their performance consistently improves after TTRL.

This is because larger models are better at generating accurate answers when using majority voting, which helps them learn more effectively from their own outputs using TTRL.

Performance Gains After TTRL Aren’t Merely Task Specific Or RL Algorithm Specific

TTRL doesn’t just improve LLM performance on the specific task it was trained on, but also helps on other tasks (out-of-distribution) it wasn’t trained for.

As shown in the image below, Qwen2.5-Math-7B demonstrates generalizable gains when trained with TTRL on one benchmark and then evaluated on others.

Performance of Qwen2.5-Math-7B on Out-of-distribution performance before and after TTRL

Similarly, TTRL applied using three different RL algorithms (GRPO, PPO, and PRIME) leads to similar accuracy gains/ entropy drops on MATH-500, indicating that it is algorithm-agnostic.

Accuracy & Entropy curves for different RL algorithms on MATH-500 using Qwen2.5-Math-1.5B. Lower entropy means more confidence in LLM predictions.

What Does This Mean For The Future Of LLMs

Labelled data is expensive and requires a significant amount of time and human effort to assemble.

With more unlabelled data available than labelled data, TTRL represents a massive leap in training LLMs, enabling them to self-evolve with experience rather than relying on human supervision.

Plot representing the chronology of dominant AI paradigms (Source: Google DeepMind’s paper titled “Welcome to the Era of Experience”)

We are rapidly transitioning towards the “Era of Experience”, and techniques like TTRL are helping us do this and step up towards superhuman intelligence.

What are your thoughts on this? Please share them with me in the comments below.

Source Of Images

All images used in the article are created by the author or obtained from the original research paper unless stated otherwise.

Subscribe to ‘Into AI’ — a newsletter where I help you explore the best and latest in Artificial Intelligence from the ground up by dissecting the original research papers.

Into AI