Direct Preference Optimization¶

1. Why eliminate the reward model¶

Chapter 11 ended with a working RLHF recipe that, taken at face value, asks an alignment team to maintain four large language models at once: the supervised fine-tuned policy that started the whole thing, a reference copy of that policy frozen for KL purposes, a learned scalar reward model with its own training set and its own failure modes, and a value head that estimates expected return for the proximal-policy-optimization (PPO) advantage [src_005, src_034]. The recipe also requires sampling from the policy in the inner loop of training, which couples optimization to the model's current generation distribution and adds the variance of stochastic-gradient policy gradients on top of all the usual large-language-model optimizer drama [src_034]. None of this is fundamental to the problem the team actually wants solved, which is to push the policy toward responses that humans prefer and away from responses they do not.

The Rafailov et al. paper that opened the post-PPO era is best read as a single question phrased mathematically [src_034]. The classical RLHF objective is a constrained optimization with a known closed-form solution. The reward model only exists in that closed form as the argument of an exponential. So if we already have the closed form in hand, do we really need to fit the reward model and then do reinforcement learning against it, or can we skip both stages by re-arranging the algebra? Direct Preference Optimization (DPO) is the answer to that question: a single binary cross-entropy loss on policy log-probability ratios that, under the same Bradley-Terry preference model and the same KL penalty as PPO-RLHF, optimizes the same population-level objective but does so without ever instantiating an explicit reward model and without an inner reinforcement-learning loop [src_034]. The paper's subtitle, "your language model is secretly a reward model", names the trick: in the right parameterization, the policy log-ratio against the reference is the implicit reward [src_034].

This chapter walks through the derivation in five steps, then gives a runnable PyTorch sketch, then surveys the family of variants the community produced after the original paper, then closes with practical advice on when DPO is the right tool and when it is not.

2. The KL-constrained policy optimum¶

We start from the same RLHF objective Chapter 11 ended on. Let \(x\) denote a prompt and \(y\) a full response (a sequence of tokens). Let \(\pi_\text{ref}\) denote the reference policy (typically the supervised fine-tuned model), let \(\pi\) denote the policy we are optimizing, and let \(r(x,y)\) denote a hypothetical ground-truth reward.

🎯 Intuition

Picture the reference policy \(\pi_\text{ref}\) as a known "good enough" starting point and the optimizer as a controller pulling the policy toward labeler-preferred regions of response space, with the KL term as a leash tethering it to \(\pi_\text{ref}\). The coefficient \(\beta\) controls leash length: small \(\beta\) is a loose leash (the policy can roam far in pursuit of reward), large \(\beta\) is a stiff leash (it stays close to home). The equation that follows is the closed-form description of where that controller settles.

The KL-constrained reward maximization objective is

\[ \max_{\pi}\; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi(\cdot \mid x)} \big[ r(x, y) \big] \;-\; \beta \, \mathrm{KL}\big( \pi(\cdot \mid x) \,\|\, \pi_\text{ref}(\cdot \mid x) \big), \]

where \(\beta > 0\) controls how strongly we penalise drift from the reference [src_034]. Chapter 11 motivated the KL term as a guard against reward hacking: the learned reward \(r_\phi\) is unreliable on responses far from the SFT distribution, and the penalty keeps the policy in the regime where the reward model was trained to be accurate [src_005, src_034].

🔗 Connection

The KL-as-reward-hacking-guard argument and the canonical reward-hacking failure mode are derived in Chapter 11 (From SFT to RLHF). DPO inherits the KL term and its motivation; what changes here is that the reward model never has to be instantiated.

This objective has a closed-form optimum. Treating \(\pi(\cdot \mid x)\) as the variable, with the constraint that it is a probability distribution over \(y\), the Lagrangian has a single multiplier for the normalization constraint. Setting the functional derivative to zero and solving gives, for every prompt \(x\),

\[ \pi_r(y \mid x) \;=\; \frac{1}{Z(x)} \, \pi_\text{ref}(y \mid x) \, \exp\!\left( \frac{1}{\beta} \, r(x, y) \right), \qquad Z(x) \;=\; \sum_{y} \pi_\text{ref}(y \mid x) \exp\!\left( \frac{1}{\beta} \, r(x, y) \right), \]

where \(Z(x)\) is the per-prompt partition function that re-normalises the right-hand side [src_034]. The full derivation is one Lagrange multiplier and one rearrangement; Rafailov et al. record it in Appendix A.1 and remark that the result has appeared in earlier control-as-inference work as well — a Bayesian framing that recasts optimal control as posterior inference [src_034]. We will not reproduce the full proof here, because the result is what matters for what follows.

Two observations are worth making explicit before we move on. First, \(\pi_r\) depends on \(r\) only through the exponential, scaled by \(1/\beta\), which is why \(\beta\) behaves like a temperature in this story. As \(\beta \to 0\) the optimal policy concentrates on the argmax response and KL becomes irrelevant; as \(\beta \to \infty\) the optimal policy collapses back onto \(\pi_\text{ref}\). Second, \(Z(x)\) is a sum over all possible responses for the prompt \(x\). For language models with a vocabulary of \(V\) tokens and responses of length up to \(T\), that sum has \(V^T\) terms, and the partition function is intractable to compute directly [src_034]. This intractability is the reason classical RLHF resorts to PPO sampling rather than computing \(\pi_r\) closed-form: even with \(r_\phi\) in hand, you cannot evaluate \(\pi_r(y \mid x)\) for arbitrary \(y\).

DPO's contribution is to show that you do not need to.

3. The implicit reward¶

🎯 Intuition

Think of policies and rewards as two faces of the same object. Section 2's closed form goes "reward → optimal policy": pick a reward, get \(\pi_r\). The next equation goes the other way, "optimal policy → reward": if you already know \(\pi_r\), you can read off \(r\) as a scaled log-ratio against the reference plus a prompt-only constant. The two directions are inverses of the same map. This is what licenses the slogan "your language model is secretly a reward model" — the policy parameters \(\theta\) are themselves the parameters of an implicit reward.

Take the logarithm of both sides of the closed-form expression for \(\pi_r\) and rearrange:

\[ r(x, y) \;=\; \beta \, \log \frac{\pi_r(y \mid x)}{\pi_\text{ref}(y \mid x)} \;+\; \beta \, \log Z(x). \]

This is the implicit-reward equation [src_034]. It says that whenever a reward function \(r\) admits the KL-constrained optimum \(\pi_r\), the reward can be reconstructed from the optimal policy as a scaled log-ratio against the reference plus a prompt-dependent constant. The intractable partition function has not gone anywhere — it is sitting inside \(\beta \log Z(x)\) — but it has been moved from a place where it gates inference to a place where, as we will see in the next section, it cancels.

🤔 Pause and reflect

Before reading on, predict: if you substitute this expression into a Bradley-Terry preference probability \(\sigma(r^*(x, y_w) - r^*(x, y_l))\), what does the prompt-only \(\beta \log Z(x)\) term contribute to the difference \(r^*(x, y_w) - r^*(x, y_l)\)? (Do not look ahead — write the answer down or say it out loud.)

Read this equation in two directions. The forward direction is the original closed-form: pick a reward, get an optimal policy. The backward direction is what the paper exploits: pick a policy of the form \(\pi_\theta\), and treat the log-ratio against the reference as a candidate reward function, up to the additive prompt term. The backward direction is what makes the policy "secretly a reward model": the policy is not just being trained against an external reward, the policy parameters \(\theta\) are themselves the parameters of an implicit reward \(\hat{r}_\theta(x, y) = \beta \log( \pi_\theta(y \mid x) / \pi_\text{ref}(y \mid x))\) [src_034]. Section 5.1 of the paper proves that this reparameterization does not lose any reward classes that the Bradley-Terry preference model can identify; we cite the result and move on [src_034].

4. The Bradley-Terry cancellation¶

Chapter 11 introduced the Bradley-Terry preference model: given a prompt \(x\) and two completions \(y_1\) and \(y_2\), the probability that a human prefers \(y_1\) over \(y_2\) is

\[ p^*(y_1 \succ y_2 \mid x) \;=\; \sigma\big( r^*(x, y_1) - r^*(x, y_2) \big), \]

where \(\sigma\) is the logistic function and \(r^*\) is the latent ground-truth reward [src_034]. The crucial structural property is that this probability depends on \(r^*\) only through the difference \(r^*(x, y_1) - r^*(x, y_2)\). Any term in \(r^*\) that does not depend on \(y\) — including a prompt-only term like \(\beta \log Z(x)\) — drops out of the difference and contributes nothing to the preference probability.

🔗 Connection

The Bradley-Terry model — a sigmoid over latent-utility differences, where the latent utilities are the ground-truth rewards on each response — was derived in Chapter 11 (From SFT to RLHF) as the standard model for pairwise human preferences. We now use its differences-only structure as the algebraic lever that kills \(\beta \log Z(x)\).

Substitute the implicit-reward expression from Section 3 into the Bradley-Terry model with \(y_1 = y_w\) (chosen) and \(y_2 = y_l\) (rejected):

\[ \begin{aligned} r^*(x, y_w) - r^*(x, y_l) \;&=\; \Big[ \beta \log \tfrac{\pi^*(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} + \beta \log Z(x) \Big] \\ &\quad - \Big[ \beta \log \tfrac{\pi^*(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)} + \beta \log Z(x) \Big] \\ &=\; \beta \log \tfrac{\pi^*(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \tfrac{\pi^*(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)}. \end{aligned} \]

The two \(\beta \log Z(x)\) terms cancel because both completions share the same prompt [src_034]. This is the moment that makes DPO RL-free. The intractable partition function never leaves the formula; we just observe that the contrastive structure of preference data is exactly the structure that kills it.

Plugging the cancelled difference back into the Bradley-Terry expression gives the preference probability under the optimal policy:

\[ p^*(y_w \succ y_l \mid x) \;=\; \sigma\!\left( \beta \log \frac{\pi^*(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi^*(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)} \right). \]

This is Eq. 6 of the original paper [src_034]. The probability that a human prefers \(y_w\) over \(y_l\) is fully determined by the optimal policy and the reference; no scalar reward model and no partition function appear.

💡 Key result

The contrastive structure of preference data — same prompt, two responses, take the difference — kills the intractable partition function \(\beta \log Z(x)\). This is what makes DPO RL-free.

5. The DPO loss¶

Given a preference dataset \(\mathcal{D} = \{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\}_{i=1}^N\) of triples — prompt, chosen response, rejected response — and an arbitrary parameterized policy \(\pi_\theta\), the DPO objective is the negative log-likelihood of the observed preferences under the expression above with \(\pi_\theta\) in place of \(\pi^*\):

\[ \mathcal{L}_\text{DPO}(\pi_\theta; \pi_\text{ref}) \;=\; -\, \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \, \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_\text{ref}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_\text{ref}(y_l \mid x)} \right). \]

This is Eq. 7 of the paper, and it is the entire training objective [src_034]. It is binary cross-entropy on a single scalar — the implicit-reward gap \(\beta \log(\pi_\theta(y_w)/\pi_\text{ref}(y_w)) - \beta \log(\pi_\theta(y_l)/\pi_\text{ref}(y_l))\) — under the label "the human said \(y_w\) won". There is no reward model, no value head, no on-policy sampling, and no PPO clipping. Training a policy with DPO requires only the same forward pass and the same data loader you would use for a standard SFT step, plus a frozen copy of the reference for log-ratio computation.

The gradient of \(\mathcal{L}_\text{DPO}\) with respect to \(\theta\) exposes the mechanism the paper calls "dynamic per-example weighting" [src_034]:

🎯 Intuition

Read the gradient as a mis-ranking detector. The bracketed sigmoid term is large when the implicit reward currently mis-ranks the pair — when the policy assigns higher reward to the rejected response than to the chosen one — and small when the policy already gets the ranking right. The gradient flows where it is needed (examples the model gets wrong) and ignores examples it has already learned. The equation that follows is the mathematical form of this detector.

\[ \nabla_\theta \mathcal{L}_\text{DPO} \;=\; -\, \beta \, \mathbb{E} \Big[ \, \sigma\big( \hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w) \big) \cdot \big( \nabla_\theta \log \pi_\theta(y_w \mid x) - \nabla_\theta \log \pi_\theta(y_l \mid x) \big) \, \Big], \]

where \(\hat{r}_\theta(x, y) = \beta \log( \pi_\theta(y \mid x) / \pi_\text{ref}(y \mid x))\) is the implicit reward [src_034]. The bracketed sigmoid term is a per-example weight that is large when the implicit reward currently mis-ranks the pair (it gives high reward to the rejected response) and small when the policy already gets the ranking right. This weighting is what prevents the loss from degenerating into a naive maximum-likelihood-on-\(y_w\) objective; without it the model collapses, as the original paper documents in Appendix Table 3 [src_034].

💡 Key result

The DPO training objective is binary cross-entropy on a single scalar — the implicit-reward gap — with no reward model, no value head, no on-policy sampling, and no PPO clipping.

🔄 Recap

Complete: starting from a reward \(r\), the KL-constrained optimum is \(\pi_r(y \mid x) \propto \pi_\text{ref}(y \mid x) \exp(\_\_\_)\).
Explain: why does the prompt-only term \(\beta \log Z(x)\) cancel under a Bradley-Terry preference difference?
Predict: for a pair \((x, y_w, y_l)\) that the policy already ranks correctly with a large implicit-reward gap, what value does the dynamic per-example weight \(\sigma(\hat r_\theta(x, y_l) - \hat r_\theta(x, y_w))\) approach, and what does that imply for the gradient on this example?

6. \(\beta\) as a temperature¶

Two practical knobs determine whether a DPO run produces a useful policy: the choice of reference \(\pi_\text{ref}\) and the choice of \(\beta\). The reference choice is usually settled by the pipeline — it is the SFT policy that produced the demonstrations, or, when no SFT policy is available, a maximum-likelihood fit to the chosen completions [src_034]. The \(\beta\) choice deserves its own paragraph.

Re-read the implicit-reward equation. The implicit reward scales linearly in \(\beta\), but so does the KL coefficient on the original RLHF objective; mechanically, \(\beta\) controls how aggressively the loss is allowed to push log-probability ratios away from one. When \(\beta\) is too high, the implicit-reward gap is large for any small log-ratio, the sigmoid saturates almost immediately, the gradient on most examples vanishes, and the policy stays glued to the reference [src_005]. When \(\beta\) is too low, the sigmoid stays in its linear region for very large log-ratios, the loss is happy to drive \(\log( \pi_\theta(y_w)/\pi_\text{ref}(y_w))\) to large positive values and \(\log(\pi_\theta(y_l)/\pi_\text{ref}(y_l))\) to large negative values without bound, and the policy drifts off the reference distribution into regions where neither the preference dataset nor the reference can speak to its behavior. The RLHF Book, which is the most recent open synthesis of community practice, reports typical \(\beta\) values in the range \(0.01\) to \(0.5\) for large-language-model fine-tuning, with \(0.1\) as a defensible first guess for new domains [src_005].

🤔 Pause and reflect

Compose the gradient mechanism from §5 with the implicit-reward equation \(\hat r_\theta = \beta \log(\pi_\theta / \pi_\text{ref})\). What happens to the per-example weight \(\sigma(\hat r_\theta(x, y_l) - \hat r_\theta(x, y_w))\) as \(\beta \to \infty\) for a policy that starts at \(\pi_\theta = \pi_\text{ref}\)? Predict the consequence for the policy's training trajectory before reading the next section. (Do not look ahead.)

⚠️ Pitfall

The two failure modes named here as theory show up in §10 as observable diagnostics: \(\beta\) too high manifests as the implicit-reward gap stuck near zero across training; \(\beta\) too low manifests as both log-ratios drifting in the same direction. Watch for the symptoms in §10 before retuning \(\beta\).

7. A PyTorch sketch¶

The DPO loss in code is short. The only subtlety is that \(\log \pi_\theta(y \mid x)\) for a language-model policy is a sum of token log-probabilities over the response tokens only — the prompt tokens are conditioning context and must be masked out of the sum, otherwise the loss penalises the policy for changing the probability of the user's prompt. Stanford CS336 covers the same masking pattern in its lectures on alignment training and supervised fine-tuning [src_004]. The following snippet is the loss only; data loading, the optimizer, and the frozen-reference forward pass are standard.

⚠️ Pitfall

If we did not mask the prompt tokens out of the loss, \(\log \pi_\theta(y \mid x)\) would also include the policy's likelihood under its own prompt — a quantity we are not trying to learn — and the loss would penalise the policy for changes in prompt-distribution probability. The response_mask in the snippet below is what enforces the response-only sum.

import torch
import torch.nn.functional as F

def response_logprob(logits, labels, response_mask):
    """Sum log-probabilities over response tokens only.

    logits:       (B, T, V) policy logits, shifted so logits[:, t] predicts labels[:, t].
    labels:       (B, T)    token ids for the full sequence (prompt + response).
    response_mask:(B, T)    1 where the token is part of the response, 0 on prompt + padding.
    """
    logp = F.log_softmax(logits, dim=-1)
    token_logp = logp.gather(-1, labels.unsqueeze(-1)).squeeze(-1)
    return (token_logp * response_mask).sum(dim=-1)

def dpo_loss(pi_logits_w, pi_logits_l,
             ref_logits_w, ref_logits_l,
             labels_w, labels_l, mask_w, mask_l, beta=0.1):
    pi_lp_w  = response_logprob(pi_logits_w,  labels_w, mask_w)
    pi_lp_l  = response_logprob(pi_logits_l,  labels_l, mask_l)
    ref_lp_w = response_logprob(ref_logits_w, labels_w, mask_w)
    ref_lp_l = response_logprob(ref_logits_l, labels_l, mask_l)
    chosen_logratio   = pi_lp_w - ref_lp_w
    rejected_logratio = pi_lp_l - ref_lp_l
    logits = beta * (chosen_logratio - rejected_logratio)
    return -F.logsigmoid(logits).mean()

The reference forward passes happen under torch.no_grad() because \(\pi_\text{ref}\) is frozen. Most production implementations cache the reference log-probabilities up front, since the reference does not change during training; this turns DPO into a single-model fine-tuning loop with a precomputed table of per-example reference log-probabilities, which is one of the reasons DPO is so much cheaper to run than PPO-RLHF [src_005].

8. The DPO family¶

Within two years of the original paper, the community produced a small zoo of algorithms that keep DPO's "no reward model, no on-policy sampling" promise while fixing one or another of its shortcomings. The four most-cited variants as of April 2026 are summarised below, with primary citations alongside Lambert's RLHF Book as the consensus framing reference [src_005].

IPO (Identity Preference Optimization, Azar et al. 2023) derives a general \(\Psi\)PO objective for preference learning — a family of preference losses parameterised by an arbitrary monotone function \(\Psi\) applied to the implicit-reward gap — and instantiates the Identity special case, which replaces DPO's log-sigmoid surrogate with a squared-error term on the implicit-reward gap, regressing \(\beta \big( \log( \pi_\theta(y_w)/\pi_\text{ref}(y_w)) - \log( \pi_\theta(y_l)/\pi_\text{ref}(y_l)) \big)\) toward a fixed margin rather than toward \(+\infty\) [src_052, src_005]. The motivation is robustness when preferences are nearly deterministic in the dataset: under DPO's logistic loss, a deterministic preference pulls the implicit-reward gap toward infinity, which can be interpreted as the optimizer trying to drive \(\pi_\theta\) infinitely far from \(\pi_\text{ref}\). IPO's bounded target prevents this overfitting [src_052, src_005].

KTO (Kahneman-Tversky Optimization, Ethayarajh et al. 2024) drops the pairwise structure entirely and trains on unpaired examples labeled as "good" or "bad" [src_053, src_005]. The loss is a prospect-theory-inspired utility function over the implicit reward — the asymmetric loss-aversion utility from behavioural economics, naming the K-T in the acronym — asymmetric between gains and losses, so the model can learn from feedback datasets that consist of scattered thumbs-up and thumbs-down rather than chosen-rejected pairs [src_053, src_005]. This matters in production because most user feedback is unpaired.

ORPO (Odds Ratio Preference Optimization, Hong et al. 2024) folds preference learning into supervised fine-tuning in a single stage, augmenting the SFT cross-entropy on \(y_w\) with an odds-ratio penalty against \(y_l\) — penalising the log-ratio of chosen-vs-rejected odds, \(\log(\pi_\theta(y_w)/(1 - \pi_\theta(y_w))) - \log(\pi_\theta(y_l)/(1 - \pi_\theta(y_l)))\), the same odds-ratio object that appears in logistic regression [src_054, src_005]. ORPO does not need a separate reference policy at all; the training run starts from a base or instruction model and produces an aligned policy directly [src_054]. The savings are substantial when the SFT and DPO phases would otherwise each require a full pass over the training data.

SimPO (Simple Preference Optimization, Meng et al. 2024) drops the reference policy from the loss and uses length-normalized average log-probability as the implicit reward, \(\hat{r}_\theta(x, y) = (\beta / |y|) \sum_t \log \pi_\theta(y_t \mid x, y_{<t})\), paired with a target reward margin in the Bradley-Terry objective [src_055, src_005]. Removing the reference removes the memory cost of a second forward pass; length normalization addresses one of the most-reported failure modes of vanilla DPO, namely a tendency to inflate response length because longer responses accumulate more log-probability difference against the reference [src_055, src_005].

Where the variants agree and disagree¶

The four variants disagree on the shape of the loss, on whether a reference policy is needed, on whether feedback is paired or unpaired, and on whether length is normalized. They agree on the underlying claim: the contrastive log-likelihood of preferences against an implicit reward is enough signal to align a language model, and the PPO machinery is not needed.

9. When to use DPO and when not to¶

DPO has become the default open-weights post-training algorithm in 2025-2026 because it is faster, more stable, and easier to tune than PPO-RLHF [src_005]. A DPO run requires only a frozen reference, a preference dataset, and an SFT-style training loop; a PPO-RLHF run requires a reward model that has to be trained and validated separately, an online sampling pipeline, a value head, and an adaptive KL controller that has to be tuned alongside everything else. For practitioners who have a clean preference dataset and a reasonable SFT model, DPO simply costs less to get to a working aligned policy [src_005].

PPO-style methods retain two clear advantages. The first is that on-policy sampling lets the algorithm see what its current policy actually generates, rather than what the dataset says some prior policy generated; this matters when the training distribution has drifted away from the policy under optimization, as it does after several rounds of iterative refinement. The second is that a strong reward signal — particularly a verifiable reward, where the reward is programmatically computed from the response rather than learned from preferences — gives PPO and its descendants a precision DPO cannot match, because DPO is fundamentally a preference-learning algorithm and cannot exploit a reward signal that has more structure than "\(y_w\) beats \(y_l\)" [src_005]. Chapter 13 picks up exactly this thread: when rewards are programmatic (math problems with checkable final answers, code with unit tests), Group Relative Policy Optimization with verifiable rewards (GRPO + RLVR) outperforms preference learning on the tasks it can express, and DPO's contrastive structure becomes the wrong tool [src_005, src_002].

The Xiao and Zhu monograph frames the same trade-off and notes that DPO and PPO are best understood as different points on the curve between data efficiency and signal strength, not as competitors for a single throne [src_002]. Most open-weights model releases in 2025-2026 ship with DPO or one of its variants as their final post-training stage; releases that instead ship with PPO or GRPO are typically reasoning-focused and have a verifiable reward to point at [src_005].

10. Diagnostics¶

Three quantities are worth logging during a DPO run [src_005]. The first is the implicit-reward gap, \(\beta \big(\log(\pi_\theta(y_w)/\pi_\text{ref}(y_w)) - \log(\pi_\theta(y_l)/\pi_\text{ref}(y_l))\big)\), averaged over the batch. A successful run has this quantity rising over training: the policy is learning to assign higher implicit reward to chosen than to rejected. A run where the gap stays near zero is a run where DPO is not learning anything, usually because \(\beta\) is too high and the sigmoid is saturating [src_005].

The second is the chosen and rejected log-ratios separately: \(\log(\pi_\theta(y_w)/\pi_\text{ref}(y_w))\) and \(\log(\pi_\theta(y_l)/\pi_\text{ref}(y_l))\). Both of these are zero at initialization (because \(\pi_\theta = \pi_\text{ref}\) at the start of training) and drift apart as training progresses. A pathological run has both terms drifting in the same direction — both becoming more negative, say — even though the gap between them is increasing; the policy is collapsing on both chosen and rejected responses, just collapsing slightly faster on rejected. This is the early symptom of \(\beta\) being too low; the eventual outcome is a policy that has drifted far from the reference and produces degenerate text [src_005].

The third is response length on a held-out prompt set. DPO has a documented length bias: because longer responses accumulate more log-probability difference against the reference, the gradient implicitly prefers longer chosen responses when the dataset itself does not control for length [src_005]. If response length on validation prompts grows monotonically with training, length is doing some of the work the preference signal was supposed to do, and the resulting policy will look good on the preference benchmark and bad in the wild. SimPO addresses this directly with length normalization; vanilla DPO addresses it indirectly via length-controlled preference datasets and early stopping [src_005].

11. Where this leads¶

DPO removed the reward model and the RL loop from preference-based alignment by exploiting a single algebraic fact: the partition function of the KL-constrained optimum is prompt-only, and prompt-only terms cancel in contrastive preference probabilities. The resulting algorithm is a one-line loss that has displaced PPO as the default post-training method for open-weights language models in the current era.

The approach has a structural limit, though, and that limit is the topic of Chapter 13. DPO is a preference-learning algorithm: it needs paired (chosen, rejected) data, and the only signal it can extract is "\(y_w\) won". When the reward for a response is not a preference but a programmatically checkable quantity — the boxed answer on a math problem, the pass rate on a unit-test suite, the validity of a structured output — preference learning throws away most of the signal. The reasoning models that emerged in 2025 (DeepSeek-R1 and its successors) replace DPO at the post-training stage with a different algorithm, GRPO, fed by verifiable rewards rather than preferences. Chapter 13 derives that algorithm and argues for when it is the right choice.

🔗 Connection

Chapter 13 (Reasoning models) derives GRPO + RLVR for verifiable-reward settings, where DPO's contrastive structure is the wrong tool. Together with Chapter 11, the alignment trilogy is: PPO-RLHF (the historical baseline), DPO (this chapter, the open-weights default), GRPO (the verifiable-reward successor).

🔄 Recap

Complete: the implicit reward under DPO is \(\hat r_\theta(x, y) = \_\_\_\).
Explain: in your own words, what is the Bradley-Terry cancellation and why does it make DPO RL-free?
Compare: vanilla DPO and SimPO — which mechanism does length normalisation fix, and which one does dropping the reference policy fix?
Predict: a held-out validation set shows response length growing monotonically with training while preference-benchmark scores also rise; which DPO failure mode does this signal, and which §8 variant addresses it most directly?

References¶

src_002: Xiao, T. and Zhu, J. (2025). Foundations of Large Language Models (arXiv:2501.09223v2). https://arxiv.org/pdf/2501.09223
src_004: Hashimoto, T. and Liang, P. (2025). Stanford CS336: Language Modeling from Scratch (Spring 2025). https://stanford-cs336.github.io/spring2025/
src_005: Lambert, N. (2026). Reinforcement Learning from Human Feedback (RLHF Book, v8). https://rlhfbook.com/
src_034: Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model (arXiv:2305.18290). NeurIPS 2023. https://arxiv.org/pdf/2305.18290
src_052: Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., and Munos, R. (2023). A General Theoretical Paradigm to Understand Learning from Human Preferences (arXiv:2310.12036, AISTATS 2024). https://arxiv.org/pdf/2310.12036
src_053: Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., and Kiela, D. (2024). KTO: Model Alignment as Prospect Theoretic Optimization (arXiv:2402.01306, ICML 2024). https://arxiv.org/pdf/2402.01306
src_054: Hong, J., Lee, N., and Thorne, J. (2024). ORPO: Monolithic Preference Optimization without Reference Model (arXiv:2403.07691, EMNLP 2024). https://arxiv.org/pdf/2403.07691
src_055: Meng, Y., Xia, M., and Chen, D. (2024). SimPO: Simple Preference Optimization with a Reference-Free Reward (arXiv:2405.14734, NeurIPS 2024). https://arxiv.org/pdf/2405.14734