Reasoning Models and Verifiable Rewards¶

1. The field is moving while we write¶

This chapter is the closing one in the alignment part, and it is the one most likely to be wrong by the time it is read. The two preceding chapters covered material that has stabilised: supervised fine-tuning followed by preference learning is the canonical post-training pipeline, and Direct Preference Optimization is the simplification of that pipeline that the open-source community converged on through 2024 and 2025 [src_005]. Reasoning is not yet at that level of consensus. Group Relative Policy Optimization (GRPO), introduced in early 2024 by Shao and colleagues at DeepSeek, plus the Reinforcement Learning from Verifiable Rewards (RLVR) paradigm popularised by DeepSeek-R1 in early 2025, currently look like the right answer for training models that can solve mathematics and programming problems by producing long chains of intermediate reasoning [src_032, src_035]. They might not look like the right answer in twelve months. The reader who comes to this chapter in late 2026 should expect either an updated revision in the repository or, at minimum, a section flagging what has changed. Lambert's RLHF Book, which we treat as the up-to-date reference for this entire part of the volume, makes the same caveat explicitly: the consensus on which preference-learning or RL algorithm to use is still forming, and recommendations are revised across editions [src_005].

What this chapter does try to be stable about is the conceptual scaffolding. There are two distinct ideas grouped under the label "chain-of-thought" (CoT), and conflating them leaks into bad intuition about what reasoning models are. There is a particular optimization algorithm — GRPO — whose central trick (a group baseline replacing a learned value function) is simple enough to write down in one paragraph and is unlikely to disappear even if its current name does. There is a particular reward-design choice — RLVR — whose scope (math, code, anything with a programmatic checker) and limits (open-ended writing, factuality on open-domain queries, anything where humans cannot quickly agree on a ground-truth answer) are clearer than the optimization details. Section 2 treats the two flavours of CoT, sections 3 through 6 treat GRPO and RLVR as algorithmic and reward components, section 7 treats test-time compute as a scaling axis distinct from the training-compute scaling laws of Chapter 9, and section 8 returns to the field-in-flux frame with a list of honest caveats. Lilian Weng's "Why We Think" is the supplementary essay that surveys the broader landscape this chapter only samples, and we recommend it as further reading for any reader who wants the wider picture [src_044].

2. Two flavours of chain-of-thought¶

The phrase "chain-of-thought" has come to mean two different things, and we need to keep them separate.

The first meaning, due to Wei and colleagues in 2022, is prompted CoT: the user supplies the model with a few-shot prompt whose exemplars include not just question-answer pairs but question-rationale-answer triples, where the rationale is a worked-out sequence of intermediate reasoning steps in natural language [src_036]. At inference time the model continues the pattern, producing its own rationale before its answer. Wei and colleagues showed that this simple change in prompting dramatically improves performance on arithmetic, commonsense, and symbolic reasoning benchmarks for sufficiently large models — most notably PaLM 540B on GSM8K, where prompted CoT achieved a state-of-the-art score that surpassed the previous best from finetuned GPT-3 with a verifier [src_036]. The model itself was unchanged. The improvement came entirely from changing the input distribution at inference time.

The second meaning, introduced by DeepSeek-R1 in early 2025 (and by OpenAI's o1 around the same time, though with no public paper to cite), is RL-trained CoT: the model is trained to produce long reasoning traces, with reinforcement learning rewarding completions that arrive at correct answers [src_032]. The reasoning is no longer elicited; it is a learned behaviour. The model has been incentivised to spend more tokens on intermediate work, to interrupt itself, to verify its own steps, and to backtrack when a path stops looking promising. DeepSeek-R1's authors describe the resulting behaviour as encompassing self-reflection, verification, and dynamic strategy adaptation, and they are explicit that the RL framework is what produces these behaviours, not a clever prompt [src_032].

The two paradigms differ on where the reasoning quality comes from and how it is checked. Prompted CoT relies on the model's pretraining distribution: the rationale the model generates is whatever its language-modelling prior thinks is plausible given the few-shot exemplars, and there is no signal during inference that distinguishes a sound chain of inferences from a fluent but wrong one. RL-trained CoT, by contrast, has a reward signal that touches the reasoning indirectly through its outcome: if the final answer is wrong, every step of the rationale that produced it is implicated and gets discouraged across enough training. The two paradigms are not exclusive — DeepSeek-R1's released production model uses both — but they are pedagogically distinct, and the rest of this chapter is about the second one.

💡 Key result

Prompted CoT and RL-trained CoT are not the same paradigm — one changes the input distribution at inference; the other changes the weights.

3. What Wei 2022 left open¶

Wei and colleagues' 2022 paper is the historical anchor for chain-of-thought as a technique, and it remains the right citation for the prompted form. Three points from that paper carry into the modern reasoning-model story.

First, prompted CoT only worked at scale. Wei and colleagues observed that the gains over standard prompting were emergent: small models did not benefit, and in some cases the rationales they produced were less accurate than direct answers. Only at sufficiently large model size — for the arithmetic benchmarks they studied, around PaLM 540B and GPT-3 175B — did the technique produce substantial improvements [src_036]. The interpretation in the 2022 paper was that intermediate reasoning is itself a capability that scales with model size, and that prompting the model to use it is what unlocks it. The framing has been productively contested. Schaeffer, Miranda, and Koyejo (2023) argue that the apparent sharpness and unpredictability of emergent abilities is largely a metric artefact: nonlinear or discontinuous metrics like exact-match accuracy produce the appearance of phase transitions, whereas linear or continuous metrics over the same outputs reveal smooth, predictable scaling [src_056].

🎯 Intuition

Imagine grading multi-step arithmetic by exact-match: a model that gets the answer right \(0\%\) of the time at one scale and \(30\%\) at the next looks like a phase transition — capability switches on at scale. Score the same outputs by token-level edit distance and the curve is smooth: the "jump" was the binarising metric, not the underlying capability. Schaeffer's claim is that much of the emergent-ability literature is reading the metric as the phenomenon.

Their critique does not eliminate the practical effect — at the scales where prompted CoT does help, it helps a lot — but it complicates the "emergent at scale" reading and is the right counterweight to keep in mind when reading the 2022 result.

Second, the reasoning was unverified. A chain of thought is, in Wei's framing, a window into the model's behaviour, but the authors were careful to note that fully characterising whether a given chain actually supports the answer remains open [src_036]. There was no mechanism, in the prompted-CoT paradigm, to push back against rationales that were superficially fluent but factually wrong. The 2022 paper's appendix contains examples of the model arriving at correct answers through reasoning steps that were themselves incorrect, and conversely arriving at wrong answers through reasoning steps that looked locally plausible. Without an outcome signal to grade the chain, this is a feature of the paradigm, not a bug to be fixed within it.

Third, prompted CoT does not change the model. After running the experiments in the paper, the underlying language model is exactly the same set of weights that started the experiment. There is no learning. This matters for the framing of section 4 onward: the move from "elicit reasoning at inference time" to "train reasoning into the weights" is not a refinement of CoT, it is a different paradigm with different leverage. The 2022 paper opened the question; it did not propose a way to close the loop between rationale quality and outcome correctness.

4. Group Relative Policy Optimization¶

GRPO is the algorithm that powers the modern reasoning-model recipe, introduced by Shao and colleagues at DeepSeek in February 2024 in the DeepSeekMath paper [src_035]. It is presented as a variant of Proximal Policy Optimization (PPO) that drops the value-function critic, and the entire algorithmic content can be stated in four lines.

For each prompt \(x\) drawn from a dataset \(\mathcal{D}\), sample a group of \(K\) completions \(\{y_1, \ldots, y_K\}\) from the current policy \(\pi_{\theta_\text{old}}\). Compute a per-completion reward \(r_i = r(x, y_i)\) for each — what the reward function is, we leave open for now and return to in section 5. Form the per-completion advantage as the within-group standardised reward,

\[A_i = \frac{r_i - \mathrm{mean}(r_1, \ldots, r_K)}{\mathrm{std}(r_1, \ldots, r_K)},\]

⚠️ Pitfall

The standardisation has two independent moving parts. The centering (\(r_i - \mathrm{mean}\)) is the actual baseline — it is what subtracts off the prompt-level reward variance the critic was supposed to absorb. The scaling (divide by \(\mathrm{std}\)) is what gives the advantage invariance to differences in reward magnitude across prompts. Some implementations keep only the centering and drop the scaling; the chapter (and the canonical DeepSeek-R1 form) uses both.

and apply a PPO-style clipped surrogate objective using these advantages, with a KL penalty against a fixed reference policy \(\pi_\text{ref}\):

🎯 Intuition

The \(\mathrm{clip}\) keeps the policy from moving too far per update by zeroing out the gradient when the importance ratio \(\rho_i\) leaves \([1-\varepsilon, 1+\varepsilon]\). Without it, a single rollout with a large advantage can drive the policy off-distribution; the clip is what makes the surrogate an honest local approximation to the true policy gradient.

\[\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}_{x \sim \mathcal{D},\, \{y_i\} \sim \pi_{\theta_\text{old}}}\!\left[\frac{1}{K}\sum_{i=1}^{K} \min\!\left(\rho_i(\theta) A_i,\; \mathrm{clip}(\rho_i(\theta), 1-\varepsilon, 1+\varepsilon) A_i\right) - \beta\, D_\text{KL}\!\left(\pi_\theta \,\|\, \pi_\text{ref}\right)\right],\]

where the importance ratio is \(\rho_i(\theta) = \pi_\theta(y_i \mid x) / \pi_{\theta_\text{old}}(y_i \mid x)\), the clip parameter \(\varepsilon\) controls how far the policy can move per update, and the KL coefficient \(\beta\) controls how far the policy can drift from the reference [src_032, src_035]. This is the form written in DeepSeek-R1's paper as Equation 1 and in the DeepSeekMath paper that introduced GRPO as Equation 3 of its corresponding section [src_032, src_035].

🤔 Pause and reflect

What changes in the equation if the standardisation \(A_i = (r_i - \mathrm{mean})/\mathrm{std}\) is replaced by the simpler \(A_i = r_i - \mathrm{mean}\)? Which of GRPO's properties does the \(\mathrm{std}\) term carry, and which are unaffected? (Predict before reading the contrast paragraph.)

🎯 Intuition

Subtracting any function of the prompt from the rewards leaves the policy gradient's expectation unchanged but reduces its variance — that is the standard control-variate argument. The within-group mean \(\mathrm{mean}(r_1, \ldots, r_K)\) is the simplest such function, and it adapts automatically to each prompt's reward scale. GRPO's central trick is precisely this substitution: a critic-free baseline that is computed, not learned.

The contrast with PPO is the one detail worth dwelling on. In standard PPO, the advantage \(A_i\) is computed using a separately learned value function \(V_\phi(x)\) as \(A_i \approx r_i - V_\phi(x)\) (or as a generalised advantage estimate, GAE — a multi-step bootstrapped baseline that trades bias for variance, used in PPO's canonical implementation). The value function is itself a model, typically the same size as the policy, that has to be trained alongside the policy and held in memory. Shao and colleagues motivated GRPO precisely as a way to drop this critic: the group mean \(\mathrm{mean}(r_1, \ldots, r_K)\) acts as a prompt-conditional baseline that subtracts off the same prompt-level reward variance the critic was supposed to absorb, but without any neural network to train [src_035]. The standardisation by \(\mathrm{std}(r_1, \ldots, r_K)\) is a variance-reduction trick that makes the advantage scale-invariant to differences in reward magnitude across prompts. The DeepSeekMath paper reports that this change roughly halves the memory consumption of training compared to PPO at a comparable batch size, because the value network is gone [src_035]. DeepSeek-R1's authors carry the same justification forward: GRPO was adopted to simplify the training process and reduce the resource consumption of PPO, which is widely used in the RL stage of LLMs [src_032].

🔗 Connection

Chapter 11 (From SFT to RLHF) derived the PPO clipped surrogate and introduced the value-function critic as the per-prompt baseline. This chapter inherits the clip and replaces the critic with the within-group sample baseline; everything else in the §4 objective is the Ch.11 PPO machinery.

A reference implementation of the inner loop, in Python-like pseudocode, takes about a dozen lines:

# Per training step.
batch = sample_prompts(D, batch_size=N)
for x in batch:
    # 1. Sample K completions from the current (old) policy.
    ys = [policy_old.generate(x) for _ in range(K)]

    # 2. Score each completion with the reward function.
    rs = [reward_fn(x, y) for y in ys]

    # 3. Form within-group advantages.
    r_mean = mean(rs)
    r_std = std(rs) + 1e-8
    As = [(r - r_mean) / r_std for r in rs]

    # 4. PPO-style clipped surrogate + KL penalty.
    for y_i, A_i in zip(ys, As):
        rho = policy.logprob(y_i, x).exp() / policy_old.logprob(y_i, x).exp().detach()
        loss_i = -min(rho * A_i, clip(rho, 1 - eps, 1 + eps) * A_i)
        loss_i += beta * kl(policy, policy_ref, x, y_i)

The structure is recognisably PPO. The only line that has changed is the advantage computation, which is now a within-group statistic rather than a critic call. In production code, the K completions per prompt are batched together, the KL term is approximated using DeepSeek-R1's unbiased estimator (the K3 estimator \(r - \log r - 1\) for \(r = \pi_\text{ref}/\pi_\theta\), which is unbiased, non-negative, and has lower variance than the naive log-ratio), and the reference policy is periodically refreshed to track the current policy — DeepSeek-R1-Zero refreshes the reference every 400 steps [src_032]. None of these production details change the conceptual content of the algorithm.

💡 Key result

GRPO is PPO with the value-function critic replaced by a within-group sample baseline.

🔄 Recap

Complete the equation. Write down the GRPO advantage \(A_i\) in terms of the per-completion rewards \(r_1, \ldots, r_K\).
Explain why GRPO halves training memory. Which network in the PPO setup does GRPO drop, and what does the within-group baseline replace it with?
Compare GRPO and PPO at the level of network components. Name the four networks held in memory during PPO training, and which of them survives in GRPO.

5. Reinforcement Learning from Verifiable Rewards¶

GRPO does not, on its own, say where the reward \(r(x, y)\) comes from. In classical RLHF, as covered in Chapter 11, the reward is the output of a learned reward model trained on human preference data: the reward model is a second large network, the policy is trained against its scalar output, and the well-known failure mode is reward hacking, where the policy finds completions that score high under the reward model but are not actually preferred by humans. Maintaining a separate reward model is also expensive — a second large network in memory and a second training pipeline to keep current.

🔗 Connection

The Bradley-Terry reward model and the PPO loop that uses it were derived in Chapter 11 (From SFT to RLHF). The contrast in this section depends on holding that derivation in mind: RLVR replaces the learned scalar reward with a programmatic one, leaving the rest of the RL loop intact.

RLVR — Reinforcement Learning from Verifiable Rewards — proposes a sharper alternative. When the task admits a programmatic check, no learned reward model is necessary. For mathematics with deterministic answers, the model is required to produce its final answer in a specified format, typically inside a box, and the reward is binary: matches ground truth, or does not [src_032]. For competitive programming, a compiler executes the model's submitted code against a suite of unit tests, and the reward is again binary, or graded by the fraction of tests passed [src_032]. The reward function is a piece of code, not a learned model.

DeepSeek-R1's authors describe the RLVR reward design used in DeepSeek-R1-Zero precisely. The reward is a sum of two rule-based components: an accuracy reward, which evaluates whether the response contains a correct final answer in the specified format, and a format reward, which incentivises the model to encapsulate its reasoning inside <think>...</think> tags and its answer inside <answer>...</answer> tags [src_032]. There is no neural reward model anywhere in DeepSeek-R1-Zero. The full reward is

\[\text{Reward}_\text{rule} = \text{Reward}_\text{accuracy} + \text{Reward}_\text{format},\]

🤔 Pause and reflect

Given \(\text{Reward}_\text{rule} = \text{Reward}_\text{accuracy} + \text{Reward}_\text{format}\), what could a model learn to optimise that would score high under this reward without producing sound reasoning? Sketch one or two failure modes before reading on. (Section 8's fourth bullet returns to this.)

with the two components combined at equal weight [src_032]. The authors are explicit about why they chose this design: neural reward models, in their experience, are susceptible to reward hacking during large-scale reinforcement learning, and retraining them adds substantial compute and pipeline complexity [src_032]. A rule-based reward sidesteps both problems. It cannot be hacked in the conventional sense, because there is no learned signal to game — the only way to score high is to be right.

The combination of GRPO with RLVR is, then, the modern reasoning-model recipe in its purest form: sample a group of completions from the current policy, score each with a programmatic check, take the within-group standardised reward as the advantage, and update the policy with PPO's clipped surrogate. No reward model. No critic. No human-labelled preferences. The training-time picture is closer to AlphaZero's self-play (the regime in which a model generates its own training data by playing against itself, with outcomes scored programmatically — the structural analogue of GRPO's K-rollout group plus verifiable reward) than to the classical RLHF pipeline of Chapter 11.

💡 Key result

GRPO + RLVR removes both the critic and the reward model from the RL stage, leaving only a policy and a programmatic check.

🔄 Recap

Explain in your own words why RLVR sidesteps reward hacking. Why does the absence of a learned reward model make the conventional reward-hacking failure mode unavailable?
Predict. Which of the following task families admit RLVR, and why: (a) competitive programming; (b) summarisation; © drug-target interaction prediction with a wet-lab readout; (d) ranking news headlines by relevance to a user query?
Complete. GRPO + RLVR removes both ___ and ___ from the RL stage.

6. DeepSeek-R1 — what the production recipe looks like¶

DeepSeek-R1-Zero is the cleanest illustration of pure RLVR: it starts from the DeepSeek-V3 base model and applies GRPO with rule-based rewards directly, with no supervised fine-tuning before the RL phase [src_032]. The result is striking. On AIME 2024, DeepSeek-R1-Zero's pass@1 score climbs from \(15.6\%\) at the start of training to \(77.9\%\) by the end, and self-consistency over 16 samples (the inference-time technique of sampling \(K\) independent reasoning chains from the same prompt and taking the majority answer) lifts it to \(86.7\%\) — which exceeds the average score of human participants in the AIME competition [src_032]. The average response length grows in parallel with accuracy: the model learns to spend more tokens on its reasoning over the course of training, without any explicit instruction to do so [src_032]. DeepSeek-R1's authors highlight one striking transition during this trajectory, which they call the "aha moment": at a particular point in training, the model spontaneously develops a habit of interrupting its own reasoning with the word "wait" or its analogues, going back to reconsider an earlier step, and rerunning the calculation [src_032]. The aha moment is presented as an emergent behaviour induced by RL pressure, not a prompted one.

Pure RLVR, however, has costs that show up in everything except mathematics and code. DeepSeek-R1-Zero's outputs suffer from poor readability and language mixing — the model occasionally combines English and Chinese within a single chain-of-thought response — and its capability outside the verifiable domains is limited [src_032]. DeepSeek-R1, the released variant, addresses these issues with a multi-stage pipeline that adds supervised fine-tuning on top of the RLVR core. The pipeline begins with a "cold-start" stage in which the V3 base model is supervised-fine-tuned on a small set of high-quality long-form CoT examples; this is followed by an RL stage that uses GRPO with rule-based rewards plus a language-consistency reward; and the resulting model is then used to generate a larger reasoning dataset via rejection sampling (generating many candidate completions and keeping only those that pass a quality filter — here, a verifier on the math-and-code traces), which is combined with non-reasoning data (writing, factuality, role-play) and used for a second round of SFT, finally followed by a second RL stage that adds a learned preference reward for helpfulness and harmlessness [src_032]. The released DeepSeek-R1 inherits R1-Zero's reasoning behaviour but produces cleaner, more readable text and generalises better to non-reasoning tasks.

From the production pipeline to distillation¶

This staged pipeline is also the recipe that lets DeepSeek-R1 be distilled into smaller models. The team uses the trained R1 to generate reasoning traces, and then supervised-fine-tunes smaller open-weight models — ranging from 1.5B to 70B parameters — on those traces [src_032]. The distilled models retain a substantial fraction of R1's reasoning ability without needing the full RL pipeline themselves, which is an artefact of the same pattern Chapter 9 met in scaling laws: capability that lives in a large model can sometimes be transferred to a smaller one through a curated training set, even if the smaller model could not have learned the capability from scratch.

The distillation result is also the practical reason DeepSeek-R1 mattered in 2025 specifically: it was the first open-weights reasoning model that approached the quality of OpenAI's o1 series on the standard math and code benchmarks, and the recipe — including the GRPO algorithm and the RLVR reward design — was published with enough detail to be reproduced. Subsequent open-weights reasoning models have followed variants of this template.

7. Test-time compute as a scaling axis¶

🔗 Connection

Chapter 9 (Scaling Laws) fixed two axes — parameters \(N\) and tokens \(D\) — at fixed training compute \(C\), with the Chinchilla allocation telling teams how to divide a budget. This section adds a third axis at a different point in the model lifecycle: tokens spent at inference. The Chinchilla framework still holds for pretraining; test-time compute extends it.

There is one more conceptual move this chapter needs to make explicit. Chapter 9 discussed scaling laws as relationships between training compute, dataset size, and model capability — the Chinchilla-style picture in which a fixed compute budget is best spent on a particular ratio of parameters and tokens. Reasoning models add a third axis: the compute spent at inference time, in the form of how many tokens of reasoning the model is allowed to produce before it commits to an answer.

The empirical observation from DeepSeek-R1-Zero's training run is that this axis is real. As the AIME pass@1 score climbed from \(15.6\%\) to \(77.9\%\) over the course of GRPO training, the average response length grew in lockstep — the model learned, under the RL signal, that spending more tokens on intermediate reasoning was a winning strategy [src_032]. The training dynamics did not just produce a better fixed-compute model; they produced a model that uses inference-time compute differently. For a question that does not need much reasoning, the model produces a short answer; for a hard problem, it produces a long chain of thought, sometimes thousands of tokens long, often with backtracking and self-verification [src_032].

🤔 Pause and reflect

During R1-Zero training, AIME pass@1 climbs in lockstep with average response length. Does the longer reasoning cause the higher accuracy, or do they merely correlate — both produced by some third factor in the RL signal? What would distinguish the two readings? (Predict before reading on.)

🎯 Intuition

The §6 observation — that R1-Zero spends short answers on easy questions and long ones on hard ones — is what licenses the leap from "co-growth during training" to "inference-time compute is a scaling axis". The model has learned to allocate compute per query, not just on average across training. That is the difference between a fixed-compute model that happens to be better and a model that uses its compute differently.

Lilian Weng's "Why We Think" essay surveys this development in broader terms, framing test-time compute as an additional resource the model can be trained to allocate, and connecting it to a broader literature on adaptive computation and latent-variable inference [src_044]. The essay is not a primary source for any single technical claim in this chapter, but it is the most comprehensive contemporary survey of the test-time-compute idea in the post-R1 era, and we recommend it as further reading for readers who want the wider context. The point worth taking from it is conceptual: training-time compute and inference-time compute are distinct levers, and reasoning models are the regime in which the second one started to matter as much as the first.

💡 Key result

Training-time compute and inference-time compute are distinct levers, and reasoning models are the regime in which the second matters as much as the first.

🔄 Recap

Complete. Reasoning models add a third axis to the Chinchilla framework: ___ compute, paid in ___.
Compare. What is the difference between "a model that uses more inference compute" and "a model that uses inference compute differently"? Which one does R1-Zero exhibit?
Predict. If a model is trained with a fixed maximum response length cap, would the §7 axis be visible in its training dynamics? Why or why not?

8. Honest caveats¶

Section 1 framed this chapter as the most fluid topic in the book. Section 8 closes it by being explicit about the open questions, in six short bullets.

First, the algorithm question is not settled. As of the cut-off date of this volume, in April 2026, Lambert's RLHF Book characterises the operational consensus as: use DPO when the data you have is offline pairwise preference data and training simplicity is paramount, use GRPO with RLVR when the task admits a verifiable reward and online sampling is feasible, and use classical PPO when neither of those conditions holds and you have engineering resources to spare [src_005]. This consensus is provisional. New optimizers, new reward designs, and new theoretical results are being published at a pace that exceeds any printed book's revision cycle. The framework in this chapter — GRPO as the optimization, RLVR as the reward design — should be read as the snapshot of best practice in early 2026, not as a prediction about what 2027 will look like.

Second, RLVR's scope is narrower than reasoning's scope. RLVR works well precisely where the task admits a programmatic checker: mathematics with deterministic answers, competitive programming with executable test cases, formal-logic problems with verifiable proofs, certain physics and chemistry problems where the answer reduces to a number. It is not yet clear whether the same paradigm extends to less verifiable domains. Open-ended writing, the helpfulness of a dialogue response, the factuality of an answer about a current event, the soundness of a legal argument, the appropriateness of a medical recommendation — none of these admit a binary check that can be embedded in a reward function. DeepSeek-R1's full pipeline already concedes this point implicitly: its post-RLVR stages reintroduce a learned reward model for helpfulness and harmlessness, because rule-based rewards do not cover those dimensions [src_032]. Extending verifiable-reward training to broader task families is one of the most active research questions in the field, and the consensus is that it is not solved [src_005].

Conceptual hazards in the label and in the reward¶

Third, "reasoning" is doing a lot of work as a label. What current benchmarks measure under the heading of reasoning — performance on math contest problems, on programming challenges, on multi-step question answering with a deterministic answer — is genuine progress, but it is not the same thing as reasoning in the broader philosophical sense, and the gap between the two is something the field is still debating. A model that has learned to produce longer chains of memorised computational rules might score very well on AIME without doing anything that a philosopher would recognise as inference. Lilian Weng's essay surveys several of the open questions this raises about whether the model's verbalised reasoning faithfully reflects whatever computation is actually producing the answer [src_044]. The chapter takes no position on these debates beyond naming them; the reader should be aware that the word "reasoning" in "reasoning model" is a term of art, not a philosophical claim.

Fourth, reward hacking has not gone away. RLVR sidesteps the classical reward-model-hacking failure mode, but it introduces new ones. A model that learns to format its answer correctly while producing reasoning that does not actually justify the answer is still scoring high under an RLVR signal, because RLVR rewards outcomes, not the soundness of intermediate steps. Spurious correlations — a model learning that certain formatting patterns or particular phrasings lead to answers being marked correct, independent of whether the reasoning is sound — appear in practice [src_005]. The RLHF Book's discussion of this is unambiguous: the absence of a learned reward model does not eliminate misalignment between what is rewarded and what is wanted; it shifts the locus of misalignment to the design of the rule-based reward itself [src_005].

The three-method taxonomy¶

Fifth, a comparison table for DPO, PPO, and GRPO is summarised below to make the cross-chapter relationships explicit. The reward signal for DPO is offline pairwise preferences (no reward model trained, no RL loop); for classical PPO it is the scalar output of a learned reward model trained on preferences; for GRPO with RLVR it is a programmatic check evaluated at training time on freshly sampled completions. DPO requires no critic and no online sampling; PPO requires both a critic (the value network) and online sampling; GRPO requires online sampling but eliminates the critic via the group baseline. DPO needs paired preference data; PPO needs preference data plus prompts for the RL phase; GRPO with RLVR needs prompts plus a verifiable reward function and no human labels at all. The three methods are not in fierce competition for the same regime — they fit different combinations of data availability and reward structure — and the right choice depends on which of those combinations describes the situation at hand [src_005, src_034, src_035].

🔗 Connection

DPO is the central topic of Chapter 12 (Direct Preference Optimization); it derives the implicit-reward identity that lets pairwise preferences be optimised without a reward model or an RL loop. The taxonomy here positions DPO at one corner (offline preference learning), classical PPO + reward model at another (online preference RL), and GRPO + RLVR at a third (online verifiable-reward RL).

Sixth, the year since DeepSeek-R1 has filled in the picture without overturning it. Between January 2025 and April 2026 the field produced enough follow-up work to test the GRPO+RLVR thesis at multiple scales and from multiple labs, and the headline finding is that the recipe holds. Moonshot's Kimi k1.5, released the same week as DeepSeek-R1, reached comparable AIME and MATH-500 numbers using a different optimizer (online mirror descent) but the same outcome-only reward signal, which corroborates the central claim of section 5: where a verifiable reward is available, the choice of policy-gradient algorithm matters less than the choice to use one at all [src_061]. Alibaba's QwQ-32B (March 2025) reproduced R1-level math performance at 32B dense parameters with the same two-component rule-based reward, a useful data point for the distillation discussion in section 6 [src_064]. Anthropic's Claude 3.7 Sonnet (February 2025) introduced the "thinking budget" — a dial on a single model that selects how many tokens of internal reasoning to spend — which is the cleanest available exhibit of section 7's test-time-compute axis [src_065]. Alibaba's Qwen3 family (May 2025) ported a hybrid thinking/non-thinking single-weight-set design to open weights and reports strong performance against R1 across a broad benchmark set [src_057]. OpenAI's o3 and the subsequent GPT-5 thinking variant continued the closed-weights lineage from o1; neither is accompanied by a paper, so the public technical trail in this chapter still ends at DeepSeek-R1. On the algorithm side, DAPO (ByteDance Seed and Tsinghua, March 2025) is the most consequential refinement of GRPO so far: it diagnoses three concrete failure modes — entropy collapse from symmetric clipping, the down-weighting of long traces under sequence-mean loss, and zero-gradient batches when every rollout in a group either passes or fails — and patches each, with Clip-Higher, Dynamic Sampling, Token-level Policy Gradient Loss, and Overlong Reward Shaping respectively [src_062]. The three-method taxonomy from the fifth bullet (DPO / classical PPO / GRPO+RLVR) still holds; DAPO sits inside the GRPO cell as a refinement, and ByteDance's later VAPO (April 2025) sits inside the classical PPO cell as a counterargument that critics still earn their keep [src_063].

🔄 Recap

Compare. Which of §8's six caveats is structurally about the algorithm (open at the optimisation-method level), which is structurally about the reward design (open at the task-family level), and which is structurally about evaluation (the label "reasoning")?
Predict. §8 bullet 5 places GRPO+RLVR, classical PPO, and DPO in three different regimes. For each pair, name one situation in which one would be preferred over the other and one in which they are interchangeable.
Explain. §8 bullet 6 reports that Kimi k1.5 reaches comparable AIME numbers using online mirror descent rather than PPO-clipped surrogates. What does that fact corroborate from §5, and what does it leave open from §4?

9. Where the book ends¶

This chapter is the last of the alignment part, and the alignment part is the last of the first cut of this book. The material that has made it into this first cut is the material that had stabilised enough by April 2026 to be written down with reasonable confidence. The material that has not yet made it into the book — diffusion and flow-matching generative models, multimodal models that extend the Transformer to vision and audio jointly, the foundations of reinforcement learning that this chapter assumed at speed, and mechanistic interpretability — is work in progress.

Reasoning models in particular are the topic in this volume most likely to look different a year from now. The reader who arrives at this chapter twelve months after publication should look for an updated version in the repository, or for a "what changed" note describing the specific claims that have been superseded. If neither exists, the reader is welcome to open an issue, or, in the spirit in which this book is being written, to send a pull request.

References¶

src_005 — Nathan Lambert. RLHF Book (v8). April 2026. https://rlhfbook.com/
src_032 — DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948, 2025. https://arxiv.org/pdf/2501.12948
src_034 — Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290, 2023. https://arxiv.org/abs/2305.18290
src_035 — Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300, 2024. https://arxiv.org/pdf/2402.03300
src_036 — Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Advances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022. https://arxiv.org/pdf/2201.11903
src_044 — Lilian Weng. Why We Think. Blog post, May 2025. https://lilianweng.github.io/posts/2025-05-01-thinking/
src_056 — Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are Emergent Abilities of Large Language Models a Mirage? In Advances in Neural Information Processing Systems 36 (NeurIPS 2023, Outstanding Paper), 2023. https://arxiv.org/pdf/2304.15004
src_057 — An Yang and the Qwen Team. Qwen3 Technical Report. arXiv:2505.09388, 2025. https://arxiv.org/pdf/2505.09388
src_061 — Kimi Team (Moonshot AI). Kimi k1.5: Scaling Reinforcement Learning with LLMs. arXiv:2501.12599, 2025. https://arxiv.org/pdf/2501.12599
src_062 — Qiying Yu, Zheng Zhang, Ruofei Zhu, et al. DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476, 2025 (ByteDance Seed + Tsinghua). https://arxiv.org/pdf/2503.14476
src_063 — Yu Yue, Yufeng Yuan, Qiying Yu, et al. VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks. arXiv:2504.05118, 2025 (ByteDance Seed). https://arxiv.org/pdf/2504.05118
src_064 — Qwen Team. QwQ-32B: Embracing the Power of Reinforcement Learning. Blog post, March 2025. https://qwenlm.github.io/blog/qwq-32b/
src_065 — Anthropic. Claude 3.7 Sonnet and Claude Code. Announcement, February 2025. https://www.anthropic.com/news/claude-3-7-sonnet

Reasoning Models and Verifiable Rewards¶

1. The field is moving while we write¶

2. Two flavours of chain-of-thought¶

3. What Wei 2022 left open¶

4. Group Relative Policy Optimization¶

5. Reinforcement Learning from Verifiable Rewards¶

6. DeepSeek-R1 — what the production recipe looks like¶

From the production pipeline to distillation¶

7. Test-time compute as a scaling axis¶

8. Honest caveats¶

Conceptual hazards in the label and in the reward¶

The three-method taxonomy¶

One year of follow-up: corroboration and refinement¶

9. Where the book ends¶

References¶