DeepSeek-R1 Beats OpenAI's o1, Revealing All Its Training Secrets Out In The Open
A deep dive into how DeepSeek-R1 was trained from scratch and how this open-source research will accelerate AI progress like never before.
It’s incredible to see how far AI has progressed in the last decade.
Most of this progress came after Google released their groundbreaking paper called “Attention Is All You Need” in 2017.
It is then that other companies worked on the ideas discussed in this paper (the Transformer architecture) to build powerful LLMs.
One of these companies, OpenAI, before the release of transformers, was heavily focused on reinforcement learning (RL) and moved its trajectory towards LLMs, benefiting exponentially from Google’s open-sourced research.
Although OpenAI started as an organisation to democratise AI, it ended up making its research and products proprietary.
The last open-source model released by OpenAI was GPT-2, which was made publicly available in November 2019.
Since then, all their model advancements have been kept a secret.
With OpenAI releasing ‘o1’ a few months ago, they have really discovered something phenomenal and trained their newer LLMs to spend more time thinking (with long Chain-of-Thought and Reinforcement learning) when solving complex problems.
What are the exact “hows” of this process? — No one knows about them.
This changes now.
DeepSeek, a company based in China, has just released ‘DeepSeek-R1’, their latest model that is capable of reasoning in complex domains just like OpenAI’s o1.
This LLM is so good that it outperforms o1 on multiple mathematical and coding benchmarks.
Even smaller LLMs distilled out of DeepSeek-R1 outperform models like OpenAI’s o1-mini and GPT-4o, and Anthropic’s Claude 3.5 Sonnet.
Here is a story in which we deep-dive into how DeepSeek-R1 was trained and how these findings will accelerate AI progress like never before.
Let’s begin!
But First, How Are LLMs Usually Trained?
LLM training starts with gathering a large amount of text data.
This data comes from publicly available sources on the web or, at times, from proprietary sources.

This data is cleaned, formatted, tokenized, and converted into text embeddings.
Next, an LLM is trained on this unlabeled data through self-supervised learning using powerful GPUs/ TPUs.
This step is termed Pre-training and helps to teach the LLM the structures of a language (grammar, semantics, and contextual relationships).
A pretrained LLM is then Supervised Fine-tuned (SFT) using relevant datasets to increase its performance on a specific task/domain (mathematical reasoning, coding, machine translation, and more).
The supervised fine-tuned LLM is then aligned to human preferences to prevent harmful response generation popularly using:
Reinforcement learning (RL) has also been used to improve Chain-of-thought reasoning in LLMs, as per an OpenAI blog describing o1.
And it is the secret sauce behind all the good that DeepSeek-R1 brings.
Let’s learn how.
Eliminating SFT With Reinforcement Learning
The experiments of the DeepSeek team start with DeepSeek-V3-Base as their pre-trained base model.
Next, instead of using SFT, they directly train it with RL to improve its reasoning performance.
This allows the model to develop its reasoning capabilities without any supervised data, self-evolving in the process.
Instead of the popular policy optimization algorithm called Proximal Policy Optimization (PPO) developed at OpenAI, they use an algorithm developed in-house called Group Relative Policy Optimization (GRPO).
The following section describes how they differ.
PPO vs. GRPO
In PPO, a policy model works alongside a critic/ value model to compute advantages A
, using Generalized Advantage Estimation (GAE).
GRPO removes the critic/value model and computes advantages based on the relative rewards of a group of sampled outputs.
This reduces the computational complexity and training costs involved with RL.
Keep reading with a 7-day free trial
Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.