A Detailed Guide To Reinforcement Learning From Human Feedback (RLHF) From Scratch

A deep dive into training an LLM and using Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) to align it with human values.

Nov 26, 2024

∙ Paid

GPT-based LLMs are next-token predictors, and they do their job pretty well.

However, these models have unintended behaviours after Pre-training (training the model for the first time on a huge amount of data).

The first issue is that they do not follow human instructions well.

This can be seen in the following example, where a pre-trained model simply reiterates the instructions in its response.

human_instruction:
"What is the capital of India?"

response_from_pretrained_gpt:
"What is the capital of India? What is capital"

This can be fixed by fine-tuning the model to follow instructions.

Here is how the response from a GPT model might change after fine-tuning it.

human_instruction:
"What is the capital of India?"

response_from_instruction_finetuned_gpt:
"New Delhi"

But the problems do not end here.

The model can make up facts —

response_from_instruction_finetuned_gpt:
"Mumbai"

The model can generate biased or toxic text —

response_from_instruction_finetuned_gpt:
"Isn't it obviously New Delhi?"

Or, the model might choose not to reply at all.

This behaviour is because LLMs are trained with the objective of predicting the next token accurately based on the probability distribution that it had learned from large datasets based on Internet data.

This objective is far from one of following the user’s instructions to respond helpfully and safely.

Or one can say that LLMs are misaligned to this objective.

In 2022, researchers at OpenAI published a useful technique to fix this and ‘align’ GPTs to produce helpful, honest, and harmless responses.

This is called Reinforcement learning from human feedback (RLHF).

OpenAI and Google DeepMind researchers previously used a similar technique for robot locomotion and to play Atari games, achieving remarkable results.

An example of an RL algorithm-based agent that interacts with an environment, receiving observations and predicted rewards based on its actions. A reward predictor, influenced by human feedback, estimates rewards and guides the agent’s actions, ensuring that the agent’s goals are aligned with human preference. (Image from ArXiv research paper titled ‘Deep Reinforcement Learning from Human Preferences’)

When they used this technique to preference-tune GPT-3 to follow a broad class of written instructions, the resulting model, InstructGPT, was found to be very good at its objective.

In human evaluations, outputs from this 1.3-billion-parameter InstructGPT model were preferred to outputs from the 175-billion-parameter pre-trained GPT-3.

Note that this is the case even though InstructGPT has 100 times fewer parameters!

InstructGPT models also showed improvements in truthfulness and did not produce toxic output as much as compared to the pre-trained GPT-3.

Here is a story where we deep dive into RLHF and learn to implement it from scratch.

Excited? Let’s begin.

But First, How Is GPT-3 Pre-Trained?

GPT-3, or Generative Pre-trained Transformer, is an LLM based on the Transformer architecture.

However, unlike the original Transformer, GPT-3 uses only the Decoder of the Transformer architecture.

This Decoder consists of:

Masked Multi-Head Self-Attention Layers that allow the model to capture dependencies between all tokens in the input sequence.
Feedforward Layers that bring non-linear transformations to the token embeddings.
Layer Normalization that stabilizes training and improves convergence.
Positional Encodings that add sequence order information to tokens.

GPT-3 has a similar architecture as GPT-2, and it has different variants ranging from 125 million parameters to 175 billion parameters (the largest version popularly called “GPT-3”).

GPT visualised (Image from author’s book ‘AI In 100 Images’)

GPT-3 was trained on a wide range of datasets that enable it to understand the rules of language.

These datasets include:

Common Crawl: Dataset that represents a large snapshot of the web, filtered to remove irrelevant content
WebText2: Dataset extracted from Reddit links with high karma points
Books1 & Books2: Dataset consisting of collections of books
Wikipedia: Dataset consisting of all English-language Wikipedia articles

Datasets used to train GPT-3 (Image obtained from ArXiv research paper titled “Language Models are Few-Shot Learners”)

These datasets are combined and pre-processed, and their text is tokenized into subwords using Byte-pair encoding (BPE).

These tokens are converted to Embeddings and are fed to the GPT-3 model, which is trained with Unsupervised learning to predict the next token in the sequence based on the previous tokens.

The loss function minimized during its training is the Negative Log-Likelihood Loss (NLL).

For a sequence of tokens X = { x(1), x(2), … x(N) }, the loss is calculated by taking the negative logarithm of the predicted probability for the correct class.

Negative Log-Likelihood Loss (NLL) where is the predicted probability of the true class x(t) given the previous inputs

Given the input token embeddings, the GPT-3 model outputs logits for these.

The softmax function converts these logits into a probability distribution over the vocabulary.

From this distribution, tokens are selected either deterministically (choosing the highest probability token) or probabilistically based on the ‘Temperature’ hyperparameter (for a more creative response).

The selected token IDs are mapped back to their corresponding words or subwords using the tokenizer.

This results in a human-readable output from the GPT model.

Why Is The Pre-trained GPT-3 So Good?

Unlike previous GPT models, the pre-trained GPT-3 model can adapt to different tasks based on user-defined prompts in natural language without requiring any further retraining or fine-tuning.

It can be used for performing different downstream tasks through:

Few-Shot Learning: By giving it a few task-specific examples in the prompt.

# Input:

Classify these as Spam or Not Spam: 
1. "You have won a free iPhone! Click here to claim your prize." -> Spam
2. "Your bank account statement is now available online." -> Not Spam
3. "Congratulations, you've been selected for a special offer!" -> Spam
4. "Reminder: Your meeting is scheduled for tomorrow at 3 PM." ->

#Output:

"Not Spam"

One-Shot Learning: By giving it a single task-specific example in the prompt.

# Input:

Translate the following English sentence to German:
Example: "The weather is sunny." -> "Das Wetter ist sonnig."

Next, translate this sentence:
"Where is the nearest train station?" ->

# Output:
"Wo ist der nächste Bahnhof?"

Zero-Shot Learning: By giving it no task-specific examples in the prompt.

# Input: 
"What is the capital of India?"

# Output: 
"The capital of India is New Delhi"

Next To Supervised Fine-Tuning (SFT)

A team of trained human labellers write up prompts. These, along with prompts sourced from the OpenAI API playground, are combined to produce a prompt dataset.

The labellers then create the desired responses to the prompts in this dataset.

These prompt-response pairs are used to fine-tune the pre-trained GPT-3 with supervised learning.

This process again uses Negative Log-Likelihood Loss (NLL), which measures the difference between the model’s predicted token probabilities and the ground-truth tokens created by labellers.

Steps to Supervised Fine-tuning a Pre-trained GPT-3 model (Image obtained from the original research paper)

Training A Reward Model

Starting from the supervised fine-tuned model, its final unembedding layer is removed, and a linear layer that outputs a scalar value representing the reward for each response is added.

This results in a Reward model (RM).

This model takes a prompt and a pair of responses as input and outputs a scalar reward for each response based on how well it aligns with labellers’ preferences (a proxy for human preferences).

The dataset for training the Reward model is derived from human labeller’s comparisons of multiple model outputs for a given prompt.

A 6 billion parameter GPT-3 is supervised fine-tuned and converted into a Reward model, as the larger 175 billion parameter model was found to be unstable when used for this purpose.

The loss function used to train the Reward model is a pairwise ranking loss based on Cross-entropy.

This might look scary, but it is easy to understand.

In the above equation:

Keep reading with a 7-day free trial

Subscribe to Into AI to keep reading this post and get 7 days of free access to the full post archives.