From SFT to RLHF¶
Pretraining gives a model a vast latent knowledge base; what it does not give the model is a disposition to deploy that knowledge in service of a user. A 175B-parameter base model is, by construction, a very good imitator of the empirical distribution of internet text. It is not a helpful assistant. The gap between "predict the next token on a webpage" and "follow this user's instruction helpfully and safely" is precisely the gap that the post-training pipeline is built to close [src_033, src_005].
This chapter walks the canonical pre-2023 closure of that gap: supervised fine-tuning on demonstrations, preference modeling via Bradley-Terry, and reinforcement learning with PPO against the learned reward model. The pipeline is named after its first widely-publicised production deployment, InstructGPT [src_033], and for roughly two years it was the recipe everyone copied. Chapters 12 and 13 will then dismantle it from two different angles, but the algorithmic content of those chapters only makes sense with this baseline in hand. We arrive in Part 6 with a pretrained base model that is, by construction, a token-prediction engine; this chapter and the next two are about converting that engine into a usable assistant.
1. The post-training pipeline¶
The whole alignment story can be drawn as a single horizontal flow:
The figure groups the three coexisting branches that the field now offers for the final preference-learning stage. Path A — the InstructGPT path that this chapter covers — fits a separate reward model on human preference comparisons and then optimises the policy against that reward with PPO, regularised by a KL penalty toward the SFT reference [src_033, src_005]. Path B is direct preference optimisation (DPO), which collapses the reward model and the RL loop into a single classification objective on the policy itself; Chapter 12 derives it. Path C is the post-2024 verifiable-reward family (GRPO and friends), which throws out human preferences entirely in favour of programmatic checkers; Chapter 13 develops it.
Treat the rest of this chapter as historical scaffolding. The Xiao and Zhu monograph organises its alignment chapter around essentially the same three stages [src_002], and Lambert's RLHF Book uses the same partition with more detail on the failure modes [src_005]. The vocabulary is shared across all three branches, even though the algorithms diverge: every modern alignment paper still talks about a reference policy, a KL constraint, and pairwise preferences. Those concepts are introduced here.
2. Why post-training at all¶
Pretraining optimises a single objective: maximise the likelihood of the next token under a corpus that is, in practice, scraped web text plus a curated long tail. This objective produces a model that is extraordinarily good at one specific game, namely text continuation, and it produces nothing else for free. In particular, it produces no incentive to refuse harmful requests, no incentive to admit uncertainty, no incentive to produce a short answer when a short answer is what is asked for, and no notion at all of "the response a helpful person would write" — the corpus contains plenty of helpful prose, plenty of unhelpful prose, plenty of toxic prose, and plenty of correct-but-irrelevant prose, and the model learns to imitate all of it in proportion to its frequency. The Ouyang et al. paper opens with this exact observation: the language modelling objective is misaligned with the deployment objective [src_033].
Post-training is the umbrella term for all the procedures applied after pretraining is finished, and its purpose is to overwrite this default disposition with a more useful one. The model already contains the relevant knowledge — it has read enough Python tutorials to answer most Python questions, enough medical literature to discuss medical literature, enough toxicology to know what is dangerous — but the disposition to surface that knowledge in the format and tone the user actually wants is not present in the base model and has to be installed [src_005]. SFT installs the format. RLHF installs the preference signal that SFT cannot encode.
🔗 Connection
The "language modelling objective" referred to here is the CLM/MLM family of pretraining objectives developed in Chapter 7 (encoder-decoder). What follows is the first place in the book where that objective and the deployment objective visibly diverge.
3. Supervised fine-tuning¶
The first stage, supervised fine-tuning (SFT), is also called instruction tuning. It is conceptually the simplest of the three: collect a dataset \(\mathcal{D}_{\text{SFT}} = \{(x_i, y_i)\}\) of high-quality (prompt, response) pairs, where each \(y_i\) is a demonstration of the desired behaviour on prompt \(x_i\), and fine-tune the pretrained model on it with the standard autoregressive cross-entropy loss restricted to response tokens. Concretely, for a single example the loss is
where \(\pi_\theta\) is the policy (the model being trained) and the sum runs only over the response tokens; the prompt \(x\) is conditioned on but not scored. Masking the prompt tokens out of the loss matters because the prompt distribution is not what we are trying to learn — we are trying to learn the conditional response distribution given a prompt [src_005, src_010].
⚠️ Pitfall
If we did not mask the prompt tokens out of the loss, the model would also be trained to imitate the user's prompt distribution — which is not the modelling goal and would actively degrade the conditional response distribution we are trying to learn. The prompt-mask is not optional; dropping it changes the objective.
For InstructGPT, the demonstrations were written by a team of around 40 contractors hired through Upwork and ScaleAI, and the SFT dataset contained on the order of thirteen thousand training prompts [src_033]. Demonstrations span the full distribution of intended uses: brainstorming, summarisation, classification, extraction, open-ended generation, and so on. Ouyang et al. trained for sixteen epochs with cosine learning-rate decay; they noted that validation loss overfits after the first epoch, but downstream preference scores keep improving, so they selected the SFT checkpoint by reward-model score on a held-out set rather than by SFT validation loss [src_033]. This is a small but instructive detail: from the SFT stage onwards, validation loss on the SFT objective is no longer the metric the practitioner cares about.
What SFT achieves and what it does not achieve are both worth stating plainly. SFT achieves format and disposition imitation: after SFT the model produces responses that look like the demonstrations, follow instructions of the type the demonstrations show, and refuse the type of requests the demonstrations refuse. SFT does not achieve calibration to relative quality. There is, by construction, only one demonstration per prompt in the SFT dataset, so the SFT loss has no way to express "response A is better than response B" beyond what is implicit in the choice of \(y_i\) as the single demonstration [src_033, src_005, src_010]. To install relative-quality calibration you need a different signal, and that signal is preferences.
💡 Key result
SFT validation loss overfits after one epoch but downstream preference scores keep improving — the proxy metric and the deployment metric have already diverged at the end of stage one.
4. Preference data¶
The preference dataset has a different shape from the SFT dataset. Each example is a triple \((x, y_w, y_l)\), where \(x\) is a prompt, \(y_w\) is the response a human labeler preferred, and \(y_l\) is the response the same labeler dispreferred. The two responses are typically sampled from one or more candidate models on the same prompt, and the labeler is asked to pick a winner [src_033, src_005].
InstructGPT collected this data at scale: the reward-model dataset contained roughly thirty-three thousand training prompts, with each prompt shown to a labeler with between four and nine candidate responses to rank, which when expanded to all pairs \(\binom{K}{2}\) produces a much larger pool of pairwise comparisons [src_033]. The Ouyang et al. paper reports inter-annotator agreement of about 73% between training labelers and 77% between held-out labelers, which is roughly the agreement rate Stiennon et al. had reported for summarisation preference labelling [src_033]. That ceiling on agreement is itself a feature of the data, not a bug: human preferences on open-ended generation are genuinely noisy, and any reward model trained on them can never be more accurate than the labelers themselves are consistent.
Two practical points are worth flagging. First, preference data is much cheaper per unit of useful signal than demonstration data, because ranking is faster than writing — a labeler can compare four candidate responses in much less time than it takes to write one good demonstration. Second, the preference dataset is what enables the model to express trade-offs. If two responses are both well-formed but one is more concise, more accurate, or less hedging, the preference signal records that. SFT alone has no way to express such a comparison [src_005].
5. The Bradley-Terry reward model¶
Given a preference dataset, the reward-modelling step turns it into a scalar function \(r_\phi(x, y)\) that scores any (prompt, response) pair. The standard model — used in InstructGPT, in the DPO derivation, and in essentially every RLHF system since — is the Bradley-Terry preference model, which assumes that the probability a human prefers \(y_w\) to \(y_l\) on prompt \(x\) is a sigmoid of the reward gap:
🎯 Intuition
Picture each candidate response as having a hidden quality score that the labeler cannot see directly. The labeler perceives each score with a small dose of logistic noise and picks the higher perceived score. Under this latent-utility model, the probability of preferring \(y_w\) rises smoothly with the gap between the underlying scores: the bigger the gap, the closer \(P(y_w \succ y_l)\) gets to one. The sigmoid in the formula below is exactly this picture written out — it is the noise distribution doing the work, not an arbitrary functional choice.
where \(\sigma(z) = 1 / (1 + e^{-z})\) is the logistic function [src_033, src_005]. Under this assumption, training a reward model from a dataset of preference pairs is exactly maximum-likelihood estimation of a logistic regression on reward differences.
🤔 Pause and reflect
Before reading the next equation, write down what the maximum-likelihood loss for the Bradley-Terry model looks like. The dataset is a set of preference triples \((x, y_w, y_l)\); the model assigns each triple the probability \(\sigma\bigl(r_\phi(x, y_w) - r_\phi(x, y_l)\bigr)\). What is the negative log-likelihood, summed over the dataset? (Do not look ahead — write it down or say it out loud.)
The negative log-likelihood loss is
That single equation is the entire reward-modelling stage. Architecturally, the reward model is typically initialised from the SFT policy with the LM head replaced by a scalar-output head that consumes the final-layer hidden state at the end-of-response position — the same sequence-classification idiom Chapter 7 used for encoder-only models [src_033]. Ouyang et al. used a 6B reward model rather than 175B because they found 175B reward-model training to be unstable and because the reward model is queried far more often than the policy during PPO, so a smaller reward model substantially reduces compute [src_033].
A small but important detail: the loss is invariant to a global shift of \(r\), since \(r(x, y_w) - r(x, y_l)\) is unchanged if a constant is added to all rewards. InstructGPT therefore normalises the reward model after training so that the labeler demonstrations achieve a mean score of zero, which keeps the per-token KL term in the PPO objective on a comparable scale across runs [src_033].
A minimal PyTorch sketch of the loss makes the structure concrete:
import torch
import torch.nn.functional as F
def bradley_terry_loss(reward_w: torch.Tensor,
reward_l: torch.Tensor) -> torch.Tensor:
"""Bradley-Terry NLL on a batch of preference pairs.
reward_w: (batch,) reward model output on preferred responses
reward_l: (batch,) reward model output on dispreferred responses
"""
# log sigmoid is numerically stable in PyTorch; equivalent to
# log( 1 / (1 + exp(-(rw - rl))) ).
return -F.logsigmoid(reward_w - reward_l).mean()
Both forward passes — through the reward model on \((x, y_w)\) and on \((x, y_l)\) — share the same parameters, so this is a Siamese-style classification objective (two forward passes through identical-parameter networks compared via their scalar outputs), and it is the only time during the alignment pipeline that the reward model parameters are updated. Once trained, the reward model is frozen for the rest of the procedure [src_033, src_005].
6. PPO and the KL-constrained objective¶
The third stage is the actual reinforcement-learning fine-tune. We have a reward function \(r_\phi\) and a starting policy \(\pi_\theta\) initialised from the SFT model. The naive thing to do would be: sample \(y\) from the policy, score it with \(r_\phi\), and do policy-gradient updates that increase the expected reward. The reason that does not work, in practice, is that the reward model was trained on completions sampled from a particular distribution — roughly the SFT model's distribution — and as soon as the policy starts to drift, it starts producing completions that are out of the reward model's training distribution, where the reward model's predictions become unreliable [src_033, src_034, src_005]. The policy will then happily exploit those unreliable predictions, since by definition the policy gradient pushes toward whatever the reward model scores high, and the result is a model that scores well on \(r_\phi\) but produces obvious garbage to a human reader. This is reward hacking in its most direct form.
⚠️ Pitfall
Reward hacking is named here in §6 — as the failure mode the KL constraint is built to mitigate — and again in §7 as one of three RLHF failure modes (under the synonym reward over-optimisation). The two are the same phenomenon: a classifier exploited by an optimiser. §7 refines the picture into a typology; the definition does not change.
The standard fix is to penalise the policy for moving away from the SFT reference \(\pi_{\text{ref}} = \pi_{\text{SFT}}\). The KL-constrained reward maximisation objective is
🎯 Intuition
Read \(J\) as a control loop. The reward term pulls the policy toward labeler-preferred regions of output space; the KL term acts as a leash of length proportional to \(1/\beta\) tethering the policy to the SFT reference. Tuning \(\beta\) is choosing the leash length: a long leash (small \(\beta\)) lets the policy wander far enough to chase reward-model artifacts, a short leash (large \(\beta\)) keeps the policy so close to the reference that it cannot capture the preference signal.
where \(\beta > 0\) is the KL coefficient [src_033, src_034, src_005]. The KL term keeps the policy close to a region where the reward model is calibrated; the reward term pulls the policy in the direction the labelers wanted. The two terms are in direct tension, and tuning \(\beta\) is one of the practitioner's most consequential choices: too small a \(\beta\) and the policy drifts and reward-hacks, too large a \(\beta\) and the policy never moves far enough from the SFT reference to capture the preference signal [src_034, src_005].
6.1 PPO as the optimiser of \(J(\pi_\theta)\)¶
Ouyang et al. optimise \(J(\pi_\theta)\) with proximal policy optimisation (PPO), an actor-critic algorithm in which the value function (the critic) is initialised from the reward model rather than the policy, and the policy update is the standard PPO clipped surrogate [src_033]. In token-level form, the surrogate at step \(t\) is
🎯 Intuition
Read the clip operator as a trust region. The importance ratio \(\rho_t = \pi_\theta / \pi_{\theta_{\text{old}}}\) measures how far the current policy has moved from the previous iterate at step \(t\); clipping it to the interval \([1-\varepsilon, 1+\varepsilon]\) caps how much one update can amplify or attenuate the action probability away from the iterate that produced the rollout. A single batch therefore cannot push the policy more than a sliver \(\varepsilon\) in either direction along any high-advantage action — that bounded step is the proximal in proximal policy optimisation.
where \(\rho_t(\theta) = \pi_\theta(y_t \mid x, y_{<t}) / \pi_{\theta_{\text{old}}}(y_t \mid x, y_{<t})\) is the importance ratio against the previous policy iterate, \(\hat A_t\) is the GAE-style advantage estimate (Generalised Advantage Estimation: a smoothed bias-variance trade-off that blends temporal-difference errors at multiple horizons), and \(\varepsilon\) is a small clipping parameter (commonly \(0.2\)) that prevents any single update from moving the policy too far in the direction of high-advantage actions [src_005]. The KL term is added to the reward at every token rather than only at the end of the sequence, which makes the KL penalty a per-token shaping signal in the advantage computation [src_033]. In InstructGPT specifically, an additional pretraining-mix term ("PPO-ptx") interleaves a small fraction of pretraining-style log-likelihood updates to mitigate "alignment tax" — performance regressions on standard NLP benchmarks that show up after RLHF [src_033].
🤔 Pause and reflect
Consider what the PPO-CLIP gradient does at the boundary of the trust region. Suppose for a particular token \(t\) we have \(\hat A_t > 0\) and \(\rho_t(\theta) = 1 + \varepsilon\) exactly. What gradient does \(L_{\text{CLIP}}\) supply at this point — and what does that imply about how far this single update can push the policy on this token? (Trace the \(\min\) and \(\mathrm{clip}\) structure for this case before reading on.)
6.2 Engineering cost and tuning surface¶
A reader who has not implemented PPO before should take away two things from the equations above. First, the algorithm is heavy: a full RLHF run keeps four large neural networks alive at once — the policy, a frozen copy of the SFT reference for the KL penalty, the frozen reward model, and the trainable critic — and that is before counting optimiser state. The 175B InstructGPT runs were correspondingly expensive, which is part of why Ouyang et al. used a 6B reward model rather than a 175B one [src_033, src_005]. Second, PPO has many small tuning knobs — clipping range, advantage normalisation, entropy bonus, KL coefficient, KL controller (fixed or adaptive), batch size, mini-batch size, number of inner epochs per rollout — and these knobs interact non-trivially. Practitioners describe PPO RLHF as "fragile" not because any single piece is mysterious but because the cumulative budget of small mistakes is small [src_005, src_034].
7. Failure modes of RLHF¶
PPO RLHF in language-model practice has three failure modes that recur across reports and that any practitioner reading the chapter should recognise.
⚠️ Pitfall
The three failure modes below are different mechanisms with the same operational tell — a policy that scores high on \(r_\phi\) but degrades in human evaluation. KL drift is a controller pathology (the adaptive controller mismanages \(\beta\)); reward hacking is a classifier pathology (the reward model has artifacts the policy finds); mode collapse is a distribution pathology (the policy concentrates onto a narrow output region). Distinguishing the three matters because the diagnostic and the fix differ for each.
The first is KL drift. The KL penalty is supposed to keep the policy close to \(\pi_{\text{ref}}\), but in practice the adaptive KL controllers used in production (which adjust \(\beta\) to hit a target KL per update) can misbehave: if the controller pushes \(\beta\) down too aggressively in response to a low-KL batch, the next batch can overshoot and drift far from the reference, into a region where the reward model is no longer calibrated [src_005]. Once the policy is in a miscalibrated region, every subsequent gradient step is, in expectation, optimising against noise.
The second is reward hacking, also called reward over-optimisation. The reward model is a classifier trained on a finite preference dataset; like any classifier it has artifacts — surface features that correlate with high reward in the training distribution but do not actually reflect what humans prefer in general. The policy, given long enough, will find these artifacts and exploit them. Common exploited artifacts in production RLHF include excessive length (responses that score high simply by being verbose), excessive use of bullet lists, and sycophantic prefixes such as "That is a great question". The DPO paper in particular discusses PPO instability under reward over-optimisation in its experimental analysis [src_034]. The KL penalty mitigates this — by keeping the policy near the reference, it limits how far the policy can chase artifacts — but it does not eliminate it, and in fact the KL penalty itself trades off against the reward, so any chosen \(\beta\) leaves residual exploitation on the table.
The third is mode collapse on long-form generation. Under RL pressure the policy can collapse to a narrow output distribution, particularly for long-form prompts where the reward signal is shaped by a small number of structural cues. Symptoms include repetitive boilerplate openings, formulaic structure across diverse prompts, and reduced output diversity at fixed temperature [src_005]. Mode collapse is not unique to PPO RLHF — it shows up in DPO too — but PPO's combination of online sampling and high-variance reward signals tends to surface it more aggressively than offline alternatives [src_034].
The honest summary is this: PPO RLHF works, in the sense that InstructGPT, GPT-3.5, GPT-4, Claude, and Llama-2-Chat all relied on variants of it, but it is a finicky pipeline that takes substantial engineering investment to make stable. The two chapters that follow are, in different ways, both reactions to that finickiness.
🔄 Recap
- Explain in your own words why an adaptive KL controller pushing \(\beta\) toward zero in response to a low-KL batch can produce KL drift on the next batch.
- Compare reward hacking and mode collapse: which is fundamentally a classifier pathology, and which is fundamentally a distribution pathology?
- Predict whether mode collapse should be more or less aggressive under DPO-style offline preference optimisation than under PPO RLHF, and identify the property of online RL that drives the difference.
8. The InstructGPT result and why it mattered¶
The headline result of the InstructGPT paper is one of the cleanest demonstrations in the alignment literature that algorithm beats scale on the right axis. On the OpenAI API prompt distribution evaluated by labelers, outputs from the 1.3B PPO-aligned InstructGPT model were preferred to outputs from the 175B raw GPT-3 base model, despite the parameter count being roughly two orders of magnitude smaller [src_033]. At the same parameter count, the 175B InstructGPT was preferred to 175B GPT-3 about 85% of the time, and preferred to few-shot-prompted 175B GPT-3 about 71% of the time. Truthfulness on TruthfulQA roughly doubled relative to GPT-3, and hallucination rates on closed-domain summarisation tasks dropped from around 41% to around 21% [src_033].
Two things follow from those numbers. First, on the deployment-relevant metric — does the model do what users actually ask — instruction tuning plus RLHF is a much higher-leverage axis than parameter count, at least once you are in the regime where the base model has the relevant knowledge. The corollary, which Ouyang et al. state explicitly, is that the alignment pipeline is not a luxury added on top of a finished model; it is a substantial part of what makes the model useful at all [src_033]. Second, and more practically, the released text-davinci-002 (the SFT-only variant with some additional tuning) and text-davinci-003 (the full PPO-aligned variant) were the production deployments of this pipeline, and ChatGPT was, to a first approximation, a chat-tuned descendant of the same pipeline [src_005]. The InstructGPT recipe — SFT, then RM, then PPO with KL-to-reference — became the default post-training stack for roughly two years, and every alternative discussed in Chapters 12 and 13 is most clearly understood as a critique of one specific component of it.
💡 Key result
On the OpenAI API prompt distribution, outputs from the 1.3B PPO-aligned InstructGPT model were preferred to outputs from the 175B raw GPT-3 base — algorithm beat scale by roughly two orders of magnitude in parameter count.
9. Where this is heading¶
This chapter has presented the canonical pipeline as Ouyang et al. wrote it down. The two chapters that follow take the same skeleton and modify it in two different places.
Chapter 12 (DPO) eliminates the explicit reward model and the RL loop entirely. The observation is that the KL-constrained reward maximisation objective \(J(\pi_\theta)\) in §6 has a closed-form optimal policy, and inverting that closed-form expression rewrites the reward as a log-ratio of the policy against the reference. The Bradley-Terry preference loss, expressed in this implicit-reward parameterisation, becomes a binary cross-entropy classification loss on the policy itself — no reward model, no PPO, no critic. DPO trades the engineering complexity of online RL for the inflexibility of offline preference data, and the trade-off was the right one for many production teams in 2023–2024 [src_005].
🔗 Connection
Chapter 12 (DPO) inverts the closed-form optimum of \(J(\pi_\theta)\) from §6, eliminates the reward model and the PPO loop, and replaces them with a single binary cross-entropy classification loss on the policy itself.
Chapter 13 (GRPO and verifiable rewards) attacks a different component: the human-preference reward model. For tasks where correctness is mechanically checkable — mathematics with a known answer, code with a passing test suite — the reward model can be replaced by a programmatic checker that returns an exact zero-or-one signal. With a verifiable reward in hand, reward hacking against learned-classifier artifacts goes away by construction, and a critic-free policy-gradient algorithm (GRPO) replaces PPO. This is the recipe behind DeepSeek-R1 and the broader 2025 reasoning-model wave [src_005].
🔗 Connection
Chapter 13 (GRPO and RLVR) replaces the learned reward model with a programmatic checker that returns an exact zero-or-one signal; the critic-free GRPO algorithm replaces PPO and is the recipe behind DeepSeek-R1 and the broader 2025 reasoning-model wave.
RLHF is not the end of the story. It is the historical baseline against which both DPO and GRPO are written, and the vocabulary it introduces — reference policy, KL constraint, Bradley-Terry preferences, reward over-optimisation — is the vocabulary the rest of this part is built on.
🔄 Recap
- Complete the equation: \(J(\pi_\theta) = \mathbb{E}\bigl[r_\phi(x, y)\bigr] - \beta \cdot \mathbb{E}\bigl[\text{?}\bigr]\) — what does \(\beta\) multiply, and why is it called the KL constraint?
- Explain in your own words what makes a policy a reference policy, and why \(\pi_{\text{SFT}}\) is the natural choice in this chapter.
- Predict, given the Bradley-Terry assumption \(P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l))\), what happens to that probability when the reward gap is zero — and what that implies about the labeler's behaviour at zero gap.
- Compare reward hacking and reward over-optimisation: which name does each chapter prefer, and what underlying phenomenon do both refer to?
References¶
- [src_002] Tong Xiao and Jingbo Zhu. Foundations of Large Language Models (arXiv:2501.09223v2), 2025. https://arxiv.org/pdf/2501.09223
- [src_005] Nathan Lambert. RLHF Book, v8, 2026. https://rlhfbook.com/
- [src_010] Sebastian Raschka. Build a Large Language Model (From Scratch). Manning, 2024. https://github.com/rasbt/LLMs-from-scratch
- [src_033] Long Ouyang, Jeff Wu, Xu Jiang, et al. Training Language Models to Follow Instructions with Human Feedback (InstructGPT), 2022. https://arxiv.org/pdf/2203.02155
- [src_034] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model, 2023. https://arxiv.org/pdf/2305.18290