Into AI

Into AI

Learn To Train A Reasoning Model From Scratch Using GRPO

Train the open-source Qwen to a reasoning model using GRPO with Unsloth, which handles all the complicated heavy lifting for you. (10 minutes read)

Dr. Ashish Bamania's avatar
Dr. Ashish Bamania
Sep 19, 2025
∙ Paid
3
1
Share
Image generated with Google ImageFX

Into AI is a reader-supported publication. Become a paid member today and get access to all the posts on this publication, and support my work.


Reasoning LLMs (also called Large Reasoning Models) are quite popular these days.

These are LLMs that think through a problem well before they answer.

While most early research on training Reasoning LLMs has been kept a “trade secret”, many recent projects have laid this process out in the open (DeepSeek-R1, DeepSeekMath, Kimi-k1.5, and DAPO).

These approaches train LLMs to generate long Chain-of-thought outputs at inference time to reason better.

They also introduce the use of modified RL algorithms, such as GRPO and DAPO, which are efficient upgrades from the original PPO developed at OpenAI.

Large reasoning models visualised (Image from author’s book titled ‘LLMs In 100 Images’)

In this lesson, we will learn about the basics of GRPO (Group Relative Policy Optimization), a popular RL algorithm for training Reasoning LLMs.

Then we will get our hands dirty by writing the code to train a reasoning LLM, learning the process well without any jargon.


Before we begin, I want to introduce you to my new book called “LLMs In 100 Images”.

It is a collection of 100 easy-to-follow visuals that describe the most important concepts you need to master LLMs today.

Grab your copy today at a special early bird discount using this link.

Keep reading with a 7-day free trial

Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Dr. Ashish Bamania
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture