Scaling Laws¶

Frontier pretraining is a one-shot decision under a fixed compute budget. Once the cluster is allocated and the data pipeline is locked, the team commits to a model size \(N\) and a training token count \(D\) and runs the schedule end-to-end. There is no opportunity to discover, after the fact, that \(N\) should have been smaller and \(D\) larger: the FLOPs are spent. Anyone who has watched a multi-million-dollar training run knows the consequence. The decision must be made on the basis of small-scale extrapolations — runs costing a fraction of the flagship — that predict where the loss-versus-compute curve will be at the target scale [src_028, src_030].

Scaling laws are the empirical claim that this extrapolation works. The cross-entropy loss of a Transformer language model has a remarkably smooth dependence on the three quantities a team controls: parameter count \(N\), training token count \(D\), and compute \(C\). Plotted in log-log coordinates, the dependence is approximately linear over many orders of magnitude. That linearity is why frontier teams can fit a curve on a sweep of small models and bet a year of cluster time on its extrapolation [src_027, src_028, src_030].

This chapter walks through how the field arrived at its current understanding of those laws. The story has three acts. First, the original Kaplan et al. (2020) formulation [src_027], which made scaling laws a quantitative discipline but pointed teams in a direction that turned out to be wrong. Second, the Hoffmann et al. (2022) Chinchilla correction [src_028], which fixed a methodological bug and reset the optimal \(N/D\) ratio. Third, the post-Chinchilla over-training regime, in which Llama-3 [src_030] and similar frontier models deliberately train far past the Chinchilla compute optimum because the dominant lifetime cost is inference, not training. Each act tightens the question of what we are actually optimizing. MoE-specific scaling, RL-from-reward scaling, and test-time-compute scaling are out of scope here; they reappear in the closing section as the bridges to subsequent chapters.

The premise: power laws in \(N\), \(D\), \(C\)¶

Define the three primary scaling variables as Kaplan et al. used them [src_027]. \(N\) is the number of non-embedding parameters in the Transformer (embedding parameters scale with vocabulary, which the original paper held roughly fixed). \(D\) is the number of training tokens consumed in a single epoch (no token is seen twice). \(C\) is total training compute, measured in floating-point operations.

🎯 Intuition

A power law looks like a straight line on log-log axes. Take the log of both sides of \(L = A N^{-\alpha}\) and the relationship becomes \(\log L = -\alpha \log N + \text{const}\) — a line with slope \(-\alpha\). Every claim in this chapter about exponents reads off such a line.

The empirical observation is that test cross-entropy loss \(L\) obeys a power law in each of these quantities, provided the other two are not bottlenecks. Holding \(D\) effectively infinite and increasing \(N\), the loss falls as \(N^{-\alpha_N}\). Holding \(N\) effectively infinite and increasing \(D\), the loss falls as \(D^{-\alpha_D}\). Plotted on log-log axes, each curve is approximately a straight line over more than seven orders of magnitude in the underlying variable [src_027].

Two further observations tighten this into a working tool for budget allocation. First, performance depends only weakly on architectural shape — depth, width, and head count — once \(N\) is fixed. The shape choices that consume an architect's afternoon barely move the loss curve, while doubling \(N\) moves it predictably [src_027]. Second, training compute is well approximated by the deceptively simple expression

\[ C \approx 6 \, N \, D \]

where the factor of 6 absorbs the forward pass (roughly \(2ND\) for the dominant matrix multiplications), the backward pass (roughly \(4ND\), twice the forward because gradients flow through both inputs and weights), with the precise multiplier depending on architectural details such as whether the FFN uses a two-matrix or three-matrix gated form like SwiGLU [src_027, src_004].

🔗 Connection

Architectural shape — depth, width, number of heads — was the active design topic of Chapter 8. The scaling-laws perspective takes those choices as approximately fungible at fixed parameter count.

🎯 Intuition

"Iso-FLOP" means "fixed compute budget". On linear axes, the curve \(ND = \text{const}\) is a hyperbola; on log-log axes, it is a line of slope \(-1\). The whole Kaplan-vs-Chinchilla story is geometrically: which point do you pick on which iso-FLOP line.

The point is that compute is approximately bilinear in \(N\) and \(D\), which gives the budget-allocation problem a clean geometric structure: the iso-FLOP curves are hyperbolas \(ND = \text{const}\).

The Kaplan formulation and its allocation prescription¶

Kaplan et al. (2020) fit specific functional forms to their data. With non-embedding parameters \(N\) in the millions-to-billions range and tokens \(D\) in the billions, they reported

\[ L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 \]

\[ L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 \]

with similar power-law dependence on minimum compute \(C_\text{min}\) at exponent approximately \(-0.050\) [src_027]. The \(N_c\) and \(D_c\) constants set the units; what matters for allocation is the ratio of exponents.

Kaplan's central allocation claim followed mechanically from those exponents. Under a fixed compute budget \(C \approx 6ND\), choose \(N\) and \(D\) to minimize the joint loss surface. Kaplan's fit gave \(D \propto N^{\alpha_N / \alpha_D} \approx N^{0.74}\), equivalently the compute exponents

\[ N \propto C^{0.73}, \quad D \propto C^{0.27} \]

🤔 Pause and reflect

Before reading on: given Kaplan's \(D \propto N^{0.74}\), do you predict that doubling compute should mostly grow the model, mostly grow the dataset, or split evenly? Argue from the exponent. (Don't peek — say the answer out loud or write it down.)

In words: most of any new compute should go into making the model larger, with only a sublinear contribution going into more data [src_027]. The slogan that escaped the paper into practitioner culture was that model size dominates. Models like GPT-3 (175B), Jurassic-1 (178B), Gopher (280B), and Megatron-Turing NLG (530B) (the 2020-2021 generation of large dense LLMs) were all trained on roughly 300 billion tokens — parameter counts grew by factors of three to six while \(D\) barely moved [src_028]. The Kaplan recommendation, taken at face value, justified that pattern.

The Chinchilla correction¶

Hoffmann et al. (2022) re-ran the experiment at much greater density and reached the opposite conclusion [src_028]. They trained more than 400 language models ranging from 70 million to over 16 billion parameters, on training token counts from 5 billion to over 500 billion, and analyzed the loss-versus-FLOPs surface using three independent methodological approaches: an iso-FLOP curve sweep, a fit of training-curve envelopes, and a parametric loss form. All three approaches agreed [src_028].

The third approach, the parametric loss, is worth writing out because it gives the result in closed form. Hoffmann et al. proposed

🎯 Intuition

Read the three additive terms before the symbols arrive. \(E\) is the entropy floor — the best loss any model could in principle reach. \(A/N^{\alpha}\) is the penalty for having only finitely many parameters; \(B/D^{\beta}\) is the penalty for training on only finitely many tokens. Both penalties decay with their own power law toward \(E\).

\[ L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} \]

The constant \(E\) is the irreducible loss — the entropy of natural language as seen by an idealized infinite-capacity, infinite-data model. The two power-law terms capture, respectively, the cost of finite parameter capacity and the cost of finite training data. To find the compute-optimal allocation, minimize \(L\) subject to the constraint \(6ND = C\). Setting up the Lagrangian \(\mathcal{L}(N, D, \lambda) = E + A N^{-\alpha} + B D^{-\beta} - \lambda (6 N D - C)\) and taking the partial derivatives, the first-order condition for \(N\) is

\[ -\alpha A N^{-\alpha - 1} - 6 \lambda D = 0, \]

and the analogous condition for \(D\) is \(-\beta B D^{-\beta - 1} - 6 \lambda N = 0\). Dividing the two conditions to eliminate \(\lambda\) and substituting the FLOP constraint \(C = 6ND\) yields the closed-form optimum

\[ N^*(C) \propto C^{\beta / (\alpha + \beta)}, \quad D^*(C) \propto C^{\alpha / (\alpha + \beta)} \]

Hoffmann et al.'s fit gave \(\alpha \approx \beta \approx 0.34\). The key point is the symmetry: when \(\alpha\) and \(\beta\) are approximately equal, the optimal allocation is approximately symmetric, and both \(N^*\) and \(D^*\) scale as roughly \(C^{0.5}\) [src_028]. Substituting \(\alpha = \beta\) into the exponents gives \(\beta/(\alpha+\beta) = \alpha/(\alpha+\beta) = 1/2\), so the optimal \(N\) and \(D\) both scale as \(C^{1/2}\) — the prediction is "split compute equally between parameters and tokens", or, with the constraint \(C = 6ND\), "tokens per parameter is constant". Compare this to Kaplan's \(C^{0.73}\) for \(N\): the corrected exponent is dramatically smaller, and the corresponding \(D\) exponent is correspondingly larger.

🤔 Pause and reflect

Substitute \(\alpha = \beta\) into the closed-form result \(N^*(C) \propto C^{\beta/(\alpha+\beta)}\) on paper. What exponent of \(C\) does \(N^*\) scale with? What does that imply for the ratio \(D^*/N^*\) at any fixed compute? (Resolve this before continuing.)

A practitioner's shorthand emerged from the Chinchilla paper's Table 3, which projected optimal token counts for various model sizes. A 1B-parameter model should see roughly 20.2B tokens; a 10B model, roughly 205B tokens; a 67B model, roughly 1.5T tokens [src_028]. Across rows, the ratio \(D^*/N^*\) stays close to 20. The slogan that survived contact with practice is "twenty tokens per parameter at the Chinchilla optimum" — a derived heuristic, not a stated equation in the paper, but a useful one for napkin calculations.

The empirical validation of Chinchilla's prediction was a head-to-head bake-off. The paper trained a 70B-parameter model on 1.4T tokens — same FLOP budget as Gopher (280B params, 300B tokens), but reallocated between \(N\) and \(D\) to land near the corrected optimum. The 70B Chinchilla outperformed the 280B Gopher across MMLU, BIG-bench, common-sense reasoning, reading comprehension, and language modelling [src_028]. The lesson was unambiguous: the field had been training models that were too big and too data-starved.

🔗 Connection

The choice of benchmark metric is itself a load-bearing assumption — see Chapter 13 on why aggregate metrics like MMLU can mask the very capabilities they claim to measure.

💡 Key result

At a fixed FLOP budget, the Chinchilla compute-optimal allocation outperforms the Kaplan allocation across major language benchmarks.

🔄 Recap

Complete. The parametric loss form posits that loss decomposes as \(L(N, D) = \_\_\_ + \_\_\_ + \_\_\_\). Name each term in plain words.
Explain. Why does the symmetry \(\alpha \approx \beta\) imply that compute should be split evenly between parameters and tokens?
Predict. A practitioner has \(C = 10^{22}\) FLOPs to spend. Using the 20-tokens-per-parameter shorthand, what (approximate) \(N\) and \(D\) should they target?
Compare. What was Kaplan's headline allocation prescription, and how did Chinchilla's bake-off against Gopher show it to be wrong?

Why Kaplan was wrong¶

The mechanism rests on the cosine learning-rate decay schedule — a schedule that ramps up from zero, then decays the learning rate following half a cosine over a chosen training horizon, ending near zero. It is worth being explicit about the methodological reason Kaplan's curve over-weighted \(N\), because the failure mode is instructive. Hoffmann et al. trace it to the learning-rate schedule [src_028, src_004]. Kaplan held the cosine learning-rate decay schedule fixed across all model sizes — decaying from a maximum to a small minimum over a horizon set once. To compare losses at different scales, Kaplan read them off intermediate points along that fixed schedule.

The problem is that the cosine schedule is well-tuned only when its decay horizon matches the actual number of training tokens. A small model run on a long fixed schedule reaches its compute-optimal stopping point well before the schedule decays — its measured loss reflects training before the schedule has decayed properly, which understates how good the small model could have been at its own optimal stopping point. A large model run on the same fixed schedule, evaluated at a comparable FLOP count, has barely begun to use its capacity but is being measured at a relatively earlier point along its own schedule.

How the mismatch biases the fitted exponents¶

The systematic effect is to make small models look worse than they are and large models look better. The fitted slope on \(N\) is too steep; the fitted slope on \(D\) is too shallow. The conclusion that \(N\) should grow much faster than \(D\) falls out of the bias [src_028].

Chinchilla fixed this by matching the cosine schedule to each run's actual token count. They also extended their training set up to 16B parameters, where Kaplan's set was dominated by sub-100M models — a wider lever arm in the regression that exposed curvature Kaplan's straight-line fit could not see [src_028].

This is not a failure of intent. Kaplan et al. did careful empirical work and reported their methodology honestly. The bug is a classic methodology trap: a hidden hyperparameter (the LR schedule) interacted with the experimental sweep (over \(N\)) in a way that biased the headline result. The cost to the field was approximately two years of misallocated compute: models like GPT-3, Gopher, and MT-NLG were all built under the Kaplan prescription and were, by Chinchilla's measurement, substantially undertrained. The lesson is generic. Any scaling study fits a low-dimensional curve through a high-dimensional configuration space; if a non-swept hyperparameter is set wrongly relative to the sweep, the curve can lie convincingly.

⚠️ Pitfall

Any scaling study fits a low-dimensional curve through a high-dimensional configuration space. An unswept hyperparameter — a learning-rate schedule, a context length, a tokenizer — can quietly bias the headline exponents in either direction.

🔄 Recap

Explain. In your own words: why did Kaplan's reuse of a single learning-rate schedule across model sizes systematically bias the fitted exponents in favour of growing \(N\)?
Predict. If a hypothetical scaling study had instead used a learning-rate schedule that was systematically too short for small models, in which direction would the headline exponents have been biased?
Compare. What did Chinchilla change in its experimental protocol, relative to Kaplan, that removed this bias?

Post-Chinchilla: the over-training regime¶

Recall that "compute-optimal" as used in §3 means: minimises pretraining loss at fixed training FLOPs — say so explicitly, because §6 will optimise a different objective. If Chinchilla settled the compute-optimal allocation question, why does Llama-3 8B train on roughly 15 trillion tokens — a \(D/N\) ratio of approximately 1875 tokens per parameter, nearly two orders of magnitude past the Chinchilla optimum of 20 [src_030]?

The Llama-3 paper is explicit about the rationale. Smaller models are trained for much longer than is compute-optimal so that they perform better than compute-optimal models at the same inference budget [src_030]. The key phrase is "same inference budget." Chinchilla's optimization minimized pretraining loss at fixed training compute. That is a useful objective if training is the only cost. It is the wrong objective if the model will be served to many users for a long time after training, because the dominant lifetime compute is then inference, not training.

The argument can be sketched arithmetically. Training compute is approximately \(C_\text{train} = 6ND\). Per-token inference compute (the forward pass only, no backward) is approximately \(C_\text{infer} \approx 2N\). Over a deployment lifetime of \(T_\text{lifetime}\) tokens served, total inference compute is roughly \(2N \cdot T_\text{lifetime}\). The total deployed compute cost of a model is

\[ C_\text{total} \approx 6ND + 2 N T_\text{lifetime} \]

🤔 Pause and reflect

Look at \(C_\text{total} \approx 6ND + 2NT_\text{lifetime}\). For what value of \(T_\text{lifetime}\) (in units of \(D\)) do the two terms balance? Predict before reading the next sentence.

🎯 Intuition

Training cost is bilinear in \(N\) and \(D\) and is paid once. Inference cost is linear in \(N\) alone and is paid forever. That asymmetry — bilinear-but-finite versus linear-but-perpetual — is why the modern frontier shrinks \(N\) and grows \(D\) past the Chinchilla optimum.

The two terms compare cleanly: \(2N T_\text{lifetime} > 6ND\) is equivalent (dividing both sides by \(2N\)) to \(T_\text{lifetime} > 3D\), so inference dominates lifetime cost once a model has been served past three times the tokens it was trained on. If \(T_\text{lifetime} \gg 3D\), the inference term dominates. For a model deployed at the scale of a frontier API, the per-day inference token volume is large; over a year or two of serving, \(T_\text{lifetime}\) easily reaches the trillions. The crossover point, beyond which inference dominates, is reached early in any consumer deployment. The Ultra-Scale Playbook frames the same trade-off from the engineering side, situating the over-training choice within the broader serving-cost picture that frontier teams optimize against [src_007].

Allocation under inference-dominated cost¶

Once inference dominates, the optimization problem changes. Holding total cost approximately fixed and pushing \(N\) smaller while pushing \(D\) larger exchanges expensive forever-inference for cheap one-time-training. The trade is favorable as long as the loss penalty from undertrained-relative-to-Chinchilla is smaller than the inference savings. Llama-3 8B and 70B both sit deep in this regime; the 405B flagship is trained closer to compute-optimal because at 405B parameters the inference-cost calculation is different — only a small number of operators can afford to serve a 405B dense model, and those operators have different cost structures [src_030].

💡 Key result

Once expected inference volume dominates total lifetime cost, the optimal tokens-per-parameter ratio rises by orders of magnitude above the Chinchilla compute-optimum.

This is not a refutation of Chinchilla. Chinchilla solves a clean, well-posed optimization problem: minimize pretraining loss at fixed training FLOPs. The post-Chinchilla regime solves a different, also clean optimization problem: minimize total deployed compute cost at fixed quality target. The two problems have different optima, and both are useful framings. The mistake would be to conflate them — for example, by claiming that Llama-3 8B's \(D/N \approx 1875\) shows Chinchilla "wrong." It does not. It shows that real engineering decisions optimize a different objective than the Chinchilla paper considered.

⚠️ Pitfall

Llama-3's high tokens-per-parameter ratio is not a refutation of Chinchilla. The two are answering different optimisation problems: Chinchilla minimises training loss at fixed training compute; Llama-3 minimises lifetime cost at fixed deployed-model size.

Llama-3's own scaling-law experiments fit \(N^*(C) = A \, C^\alpha\) with \(\alpha \approx 0.53\) and a small prefactor, and the team noted that iso-FLOP curves flatten near their minimum at large compute, which makes the precise \(N/D\) split robust to small misjudgements [src_030]. That flatness is itself a useful operational fact: at frontier scale, the cost of mild over-training is small.

Where scaling laws break¶

The scaling laws as described above are claims about pretraining cross-entropy loss. They hold over many orders of magnitude in \(N\), \(D\), and \(C\) when none of the three is bottlenecking the others. They break, or at least become unreliable, in several specific situations.

First, when the small-scale fitting set has structural differences from the target regime, the extrapolation fails. The Kaplan fixed-LR-schedule failure mode is the canonical example: the fitting set differed methodologically from what optimal large-scale training looks like, and the resulting curve was off [src_028, src_004].

Second, when the loss being predicted is a downstream benchmark rather than pretraining cross-entropy, simple power laws are not guaranteed. Llama-3 explicitly addresses this by adding a second stage to its scaling pipeline: first fit pretraining loss versus FLOPs on small models, then map loss to benchmark accuracy using Llama-2 anchor points [src_030]. The two-stage construction is a generalizable recipe and an honest acknowledgement that capability metrics do not in general inherit the power-law smoothness of cross-entropy.

Third, when data quality rather than quantity becomes the constraint, the pure \(N\)-\(D\) formulation is incomplete. Chinchilla itself notes that further scaling requires more, higher-quality data [src_028]; Llama-3's data-mix scaling-law machinery is the explicit response — fit scaling laws to candidate data mixes on small models and iterate on the mix [src_030].

Fourth, and most contested, scaling laws as fit on pretraining loss are silent on emergent capabilities — the observation that some downstream skills, like multi-step arithmetic or chain-of-thought reasoning, appear to switch on at particular scales rather than improving smoothly. The empirical literature is split on whether these jumps are real phenomena of the loss-to-capability map or artifacts of metric choice, with later work arguing that smoother metrics often reveal smooth underlying improvement where harsh thresholded metrics had suggested a jump [src_004, src_046]. This chapter does not adjudicate the controversy; it notes only that pretraining scaling laws are a tool for predicting pretraining loss, and the loss-to-capability translation is a separate empirical question on which informed people disagree.

What scaling laws still tell us¶

In 2026, scaling laws remain the discipline that turns frontier pretraining from a leap of faith into a forecast. They predict pretraining loss at the target scale reliably, given a careful fitting set. They tell a team where a proposed architectural change will and will not move the curve — most width-and-depth shape changes do not, while changes that affect the effective information per parameter (better tokenization, mixture-of-experts, better data) do [src_006, src_004]. They are the basis on which a flagship configuration is chosen.

What they do not do is predict capabilities. A team that fits Chinchilla-style scaling laws gets a confident prediction of cross-entropy at 70B; what they do not get is a confident prediction of MMLU score, of code-generation accuracy, or of whether the model will spontaneously do chain-of-thought on novel problems. The two-stage construction Llama-3 uses — predict loss, then map loss to capability via anchor models — is the current best practice for forecasting capability, but the second stage carries empirical risk the first does not.

Two orthogonal scaling axes deserve flagging here as bridges to later chapters. First, mixture-of-experts (Chapter 10) breaks the assumption that all parameters are active per token, which changes the relationship between \(N\), FLOPs, and capability — total parameters and active parameters become separate quantities, and the right scaling-law form for MoE looks different from the dense form considered here. Second, test-time compute (Chapter 13) operates at a different point in the model lifecycle entirely: instead of allocating more compute to pretraining, allocate it at inference via longer reasoning chains. Both axes complicate the simple "more \(N\) or more \(D\)" framing that this chapter has worked within. Neither invalidates it.

🔗 Connection

Two complementary scale axes are picked up in the next chapters: the parameter axis is taken sparse in Chapter 10 — Mixture of Experts, and the inference axis is taken longer in Chapter 13 — Reasoning Models.

The discipline that Kaplan founded and Chinchilla repaired remains the right starting point for any thinking about scale. The mistake, after Chinchilla, is to treat the laws as a recipe rather than a measurement; the right posture is to refit them, on the configuration the team actually intends to ship, before committing to a flagship run.

💡 Key result

Scaling laws are a measurement of one configuration choice on one corpus, not a universal recipe — refit before each flagship run.

🔄 Recap

Compare. What is the difference between predicting the loss of a flagship model from a scaling law and predicting its capability on a downstream benchmark? Which is reliable?
Predict. A team is about to commit \(10^{25}\) FLOPs to a flagship run. They have 4× \(10^{22}\)-FLOP runs from a smaller-scale ladder. Should they trust those small-scale fits without further work? What should they do first?
Generate. Construct an example (in your own words, ~3 sentences) of a scaling claim that would be a measurement (legitimate) versus one that would be a recipe (over-reach).

References¶

[src_004] Stanford CS336: Language Modeling from Scratch (Spring 2025). Tatsunori Hashimoto and Percy Liang. 2025. https://stanford-cs336.github.io/spring2025/
[src_006] Hugging Face Smol Training Playbook (Oct 2025). Hugging Face. 2025. https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook
[src_007] Hugging Face Ultra-Scale Playbook (Feb 2025). Hugging Face. 2025. https://huggingface.co/spaces/nanotron/ultrascale-playbook
[src_027] Scaling Laws for Neural Language Models. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei. 2020. https://arxiv.org/pdf/2001.08361
[src_028] Training Compute-Optimal Large Language Models. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. 2022. https://arxiv.org/pdf/2203.15556
[src_030] The Llama 3 Herd of Models. Aaron Grattafiori et al. 2024. https://arxiv.org/pdf/2407.21783
[src_046] Princeton COS 597R: Deep Dive into LLMs (Fall 2024). Sanjeev Arora and Danqi Chen. 2024. https://princeton-cos597r.github.io/