Skip to content

Inside a Modern Decoder-Only LLM

Chapters 2 through 4 introduced the components — RoPE, RMSNorm, SwiGLU, GQA, FlashAttention, KV-cache. Chapter 7 walked the architectural trichotomy and closed by handing the question of why decoder-only won to this chapter. The job here is integrative: take the components, fit them together into a 2024-2025 decoder block, and step back to see what the resulting class of models has in common across vendors. The centerpiece is a comparison table that puts Llama, DeepSeek-V3, GPT-3, and ModernBERT in the same coordinate system, so the reader can read off both the convergence (every modern decoder uses pre-RMSNorm + RoPE + SwiGLU + a grouped-attention variant) and the deltas that distinguish frontier MoE from the dense Llama line.

The chapter then turns to tokenization, the inference pipeline (prefill and decode), speculative decoding, and the post-Chinchilla over-training regime that gives us 8B-parameter models trained on 15 trillion tokens. DeepSeek-V3 gets its own short section because its Multi-head Latent Attention and auxiliary-loss-free MoE balancing are the two architectural deltas the rest of the field has been converging toward through 2025.

From GPT-3 to today

The decoder-only Transformer arrived as a serious frontier candidate with GPT-3 in 2020. Brown and colleagues showed that a sufficiently large decoder-only model — 175 billion parameters trained on roughly 300 billion tokens with vanilla next-token prediction — exhibited zero-shot and few-shot in-context learning across a broad range of tasks at inference time, with no parameter updates [src_037]. That result reframed what a decoder-only model is for. It is not a clever trick for one task; it is a substrate that, given enough scale and enough data, gains a general capability to follow patterns presented in its prompt.

Three years later, Meta's Llama 2 release demonstrated that the same recipe — decoder-only, autoregressive, scaled — could be reproduced and released as open weights at 7B and 70B parameter scales [src_029]. Llama 2 also fixed a particular set of architectural choices — pre-RMSNorm, RoPE, SwiGLU, and Grouped-Query Attention at the larger scale — that the field then converged on. The Llama 3 herd in 2024 refined the same recipe with a wider training-data mix, a 128k-token tokenizer, and GQA at every model size [src_030]. DeepSeek-V3, released at the end of 2024, kept the same component vocabulary but pushed two deltas — Multi-head Latent Attention for KV-cache compression and an auxiliary-loss-free mixture-of-experts router — that the rest of the frontier has been adopting variants of [src_031]. The architecture, in other words, has converged. The remaining axes of differentiation are tokenizer coverage, training-data composition, the dense-vs-MoE choice, and the specific GQA / MQA / MLA variant of attention. This chapter makes that claim concrete.

🔄 Recap

Before turning to the canonical block in §2, retrieve the Chs.2-4 components this chapter assumes you have met:

  • Complete the convergence recipe. Every modern decoder uses pre-______ normalisation, RoPE on Q/K, ______ (gated) activation in the FFN, and grouped-______ attention.
  • Explain in your own words. Why does in-context learning emerge from scale + next-token prediction alone, with no explicit task supervision?
  • Predict the four differentiation axes. Given the convergence list above, name the four axes on which the remaining design decisions for a 2024-2026 decoder are made.

The canonical 2026 decoder block

Modern decoder-only Transformer block (residual stream view): pre-RMSNorm and RoPE feed into a Grouped-Query Attention sublayer, followed by pre-RMSNorm into a SwiGLU feed-forward network. Each chapter in Part II elaborates one component.

Figure 1 shows the block that has converged across the Llama / Qwen / Gemma / DeepSeek-V3 family. A residual stream of shape \([B, T, D]\) enters the block. It is split into two paths. The first path normalizes the input with RMSNorm, projects it to queries, keys, and values, applies RoPE to the queries and keys, runs Grouped-Query Attention with a causal mask (or, in DeepSeek-V3, Multi-head Latent Attention), and adds the result back into the residual stream. The second path normalizes again with RMSNorm, runs the SwiGLU feed-forward network, and adds that back into the residual stream as well [src_002, src_047].

Three things deserve emphasis. First, the placement of the normalization. In the pre-norm configuration the RMSNorm is applied before each sublayer's transformation, leaving the residual stream itself unnormalized; the residual addition is between the unnormalized input and the sublayer's output. This is more stable at depth than the post-norm placement of the original Vaswani Transformer, where the LayerNorm sat after the residual add — a configuration that struggled to train at the layer counts modern decoders use [src_002, src_006].

🤔 Pause and reflect

Before reading on: pre-norm normalises the input to each sublayer and leaves the residual stream unnormalised; post-norm normalises after the residual add, so the residual stream itself is normalised at every step. Predict — which configuration's residual stream norms grow with depth, and which stays roughly constant? (Do not look ahead — write the answer down or say it out loud.)

Chapter 3 develops this story in detail.

Second, RoPE is applied only to the query and key projections, not to values. The mechanism is a position-dependent rotation in pairs of embedding dimensions, and the resulting attention scores depend on the relative position of two tokens through that rotation.

🎯 Intuition

Imagine each pair of embedding dimensions as a clock hand whose angle increases linearly with token index. Two tokens at positions \(m\) and \(n\) have their clock-hand pairs rotated by \(m\theta\) and \(n\theta\); the inner product between the rotated pairs depends only on the angle difference \((m-n)\theta\), so attention scores see relative position automatically — no separate position embedding is added to the values stream.

The RoPE base frequency \(\theta\) is a hyperparameter; Llama-3 uses \(\theta = 500{,}000\), which extends the band of frequencies that the rotation can resolve and helps the model use long contexts [src_030]. RoPE itself is the topic of Chapter 2.

Third, GQA replaces the textbook Multi-Head Attention. In MHA each query head has its own key and value heads; in MQA all queries share a single key/value head pair; in GQA the queries are split into groups, and each group shares one key/value pair [src_006, src_030]. The trade-off is set by the ratio \(h / h_{kv}\), where \(h\) is the number of query heads and \(h_{kv}\) is the number of key/value heads. Llama-3 70B uses \(h = 64\) and \(h_{kv} = 8\), an 8-to-1 sharing ratio [src_030].

🎯 Intuition

Picture 64 query heads as 64 readers all consulting the same library. With MHA, every reader has a private notebook (64 notebooks total); with MQA, all 64 share one notebook (cheap to store but every read is contended); with GQA at \(h_{kv} = 8\), the readers split into 8 groups of 8, each group sharing one notebook. That is an ×8 reduction in notebook storage with negligible read contention — the engineering sweet spot between MHA's redundancy and MQA's bottleneck.

The KV-cache memory of a decode step scales linearly in \(h_{kv}\), so this ratio sets the memory footprint of inference. We return to that calculation in §5.

The block is otherwise unsurprising: the FFN expansion runs through SwiGLU, residual addition closes each sublayer, and the output is another tensor of shape \([B, T, D]\) that feeds the next layer.

💡 Key result

The 2024-2026 decoder block is pre-RMSNorm + RoPE on Q/K + Grouped-Query Attention + SwiGLU FFN, with residual addition closing each sublayer — the recipe the §3 comparison table is about to demonstrate as convergent across vendors.

Architecture comparison table

The table below puts five models in the same coordinate system. Three are dense decoders (Llama-3 8B, Llama-3 70B, GPT-3 175B); one is a sparse MoE decoder (DeepSeek-V3); and one is an encoder included for contrast (ModernBERT). Llama-2 is implicit: the Llama-3 columns are direct refinements of the Llama-2 recipe, not departures from it.

Dimension Llama-3 8B Llama-3 70B DeepSeek-V3 ModernBERT (encoder) GPT-3 175B (historical)
Family dense decoder dense decoder sparse MoE decoder dense encoder dense decoder
Normalization Pre-RMSNorm Pre-RMSNorm Pre-RMSNorm Pre-LN with bias-free LayerNorm (no learnable bias parameter) Pre-LN
Positional encoding RoPE, \(\theta = 500{,}000\) RoPE, \(\theta = 500{,}000\) RoPE RoPE (alternating \(\theta = 160{,}000\) global / \(\theta = 10{,}000\) local) learned absolute
FFN activation SwiGLU SwiGLU SwiGLU GeGLU GELU
Attention variant GQA, \(h = 32\), \(h_{kv} = 8\) GQA, \(h = 64\), \(h_{kv} = 8\) MLA (latent KV-cache) + MoE FFN full bidirectional, alternating global / local sliding 128 MHA
Vocabulary \(V\) \(128{,}000\) (tiktoken-style BPE) \(128{,}000\) \(128{,}000\) \(50{,}368\) (modified OLMo BPE) \(\approx 50{,}257\)
Hidden \(D\) \(4{,}096\) \(8{,}192\) \(7{,}168\) (encoder, varies by checkpoint) \(12{,}288\)
Layers \(L\) \(32\) \(80\) \(61\) (encoder) \(96\)
Native context \(8{,}192\) tokens (extended later) \(8{,}192\) tokens (extended later) \(128{,}000\) tokens \(8{,}192\) tokens \(2{,}048\) tokens
Total params \(\approx 8 \times 10^9\) \(\approx 70 \times 10^9\) \(671 \times 10^9\) (encoder) \(175 \times 10^9\)
Active params per token dense, \(\approx 8 \times 10^9\) dense, \(\approx 70 \times 10^9\) \(37 \times 10^9\) dense dense, \(175 \times 10^9\)
Pretraining tokens \(\approx 15 \times 10^{12}\) \(\approx 15 \times 10^{12}\) \(14.8 \times 10^{12}\) \(\approx 2 \times 10^{12}\) \(\approx 0.3 \times 10^{12}\)
Pretraining compute (large; see paper) (large; see paper) \(\approx 2.788 \times 10^6\) H800 GPU-hours (encoder; smaller scale) (large; see paper)

🔗 Connection

ModernBERT's alternating global / local sliding-window attention is unpacked in Chapter 7 — Encoder, Decoder, and Encoder-Decoder. The dual-RoPE-base trick (160K global / 10K local) is the long-context analogue of the alternating-attention layer stack.

🤔 Pause and reflect

Before reading the observations below, scan the table top-to-bottom and predict: which row has the most convergence across decoder columns? Which row has the most divergence? Which DeepSeek-V3 cell is the most likely to surprise you given the four prior columns? (Do not look ahead — write the answers down before reading on.)

References for the columns: Llama-3 [src_030]; DeepSeek-V3 [src_031]; ModernBERT [src_016]; GPT-3 [src_037]; Llama-2 baseline behind the Llama-3 column [src_029]. The narrative gloss that frames the table — that every successful model is a known foundation modified for specific needs — comes from the Smol Playbook [src_006]. Qwen and Gemma columns are not in the table because the table fixes a late-2024 snapshot; the primary technical reports for the Qwen3 family [src_057] and Gemma 3 [src_058] arrived in mid-2025 and March-2025 respectively, and we treat them in the Where the open-weights frontier sits in April 2026 subsection below rather than retrofitting them into the historical comparison.

A few observations worth pulling out of the table.

The convergence is real. Outside the historical GPT-3 column, every decoder uses pre-RMSNorm, RoPE, SwiGLU, and a grouped attention variant. The encoder column (ModernBERT) shares all of these except the activation choice (GeGLU vs SwiGLU) and the directionality of attention (full bidirectional with the alternating global/local pattern described in Chapter 7) [src_016].

GQA's sharing ratio is conservative across the dense Llama family. Both the 8B and 70B Llama-3 checkpoints use \(h_{kv} = 8\), even though the 70B model has eight times as many query heads [src_030]. This is the design that makes KV-cache memory scale gently with model size; it is also what makes the 70B model viable on a single 8-GPU host at modest batch sizes.

The vocabulary jumped from GPT-3's 50k-class encoding to a 128k-class encoding for the Llama-3 line and DeepSeek-V3 [src_030, src_031]. The larger vocabulary improves compression on multilingual and code data and reduces sequence lengths at fixed text content. We discuss this in §4.

The token-to-parameter ratio in the bottom rows is the over-training story. Llama-3 8B trained on 15T tokens has a tokens-per-parameter ratio of roughly \(1{,}875\), far above the Chinchilla-optimal \(\approx 20\) that Chapter 9 derives. We unpack this in §7.

DeepSeek-V3 is the first column that breaks the dense pattern. It has \(671 \times 10^9\) total parameters but routes each token to only \(37 \times 10^9\) active parameters per forward pass through a mixture-of-experts FFN [src_031]. The total/active distinction is the defining feature of MoE inference economics, and Chapter 10 develops the routing and load-balancing machinery in detail. The MLA attention in the same column is a separate axis of innovation and we return to it in §8 below.

🔗 Connection

DeepSeek-V3's mixture-of-experts routing and auxiliary-loss-free load balancing are developed in Chapter 10; the Multi-head Latent Attention axis is unpacked in §8 below.

💡 Key result

Outside the historical GPT-3 column, every modern decoder uses the same four-component recipe — pre-RMSNorm, RoPE, SwiGLU, grouped attention — and differentiates on the attention variant, dense-vs-MoE choice, vocabulary size, and training-data mix; the H3 subsection below records how those differentiating axes shifted through April 2026.

Where the open-weights frontier sits in April 2026

The comparison table above captures a late-2024 snapshot. By April 2026 the open-weights frontier has moved through two more release waves. The convergent core — pre-RMSNorm, RoPE-style positional encoding, SwiGLU, and grouped-query attention — survived. Every model that landed between mid-2025 and April 2026 keeps that recipe. What changed is everything around it.

First, MoE became the default at the top end. Llama 4 Scout and Maverick (April 2025) were Meta's first MoE releases, with 17 billion active parameters across 16 and 128 experts respectively [src_059]. DeepSeek-V4 (April 2026) ships at 1.6 trillion total parameters with 49 billion active per token [src_060]. Qwen3-235B-A22B (May 2025) activates 22 billion of 235 billion total parameters across 128 experts with 8 active [src_057]. The dense-versus-sparse split that the table treats as a Llama-vs-DeepSeek dichotomy is now a sparsity-ratio axis, with the most aggressive ratios near 5 percent active.

Second, attention pluralized. DeepSeek went MLA → DeepSeek Sparse Attention → a Compressed and Heavily-Compressed Attention hybrid [src_060]. Gemma 3 doubled down on a 5:1 sliding-local-to-global ratio with a 1024-token local window [src_058]. Llama 4 introduced "iRoPE": some layers carry no positional embedding at all, paired with an inference-time attention temperature scaling for length generalisation [src_059].

🎯 Intuition

iRoPE = interleaved RoPE: in some layers RoPE is applied normally, in others no positional encoding is applied at all. The no-positional-info layers can attend to any token regardless of distance, which improves length generalisation; the RoPE layers carry the positional signal. The temperature scaling at inference compensates for the longer-range attention that the no-positional layers admit when context grows beyond training length.

⚠️ Pitfall

Six new model identifiers and four new attention variants in three paragraphs is preview density, not learn-by-heart density. The takeaway is the layer-mixing axis (which blocks alternate, how positions are encoded across them, how aggressively the FFN is sparsified) — not memorising which variant lives in which model.

Third, RoPE base frequencies climbed. Llama-3 used 500K; Qwen3 and Gemma 3 use 1 million [src_057, src_058]. Native context windows pushed from 128K (the late-2024 norm) to 256K-1M, with Llama 4 Scout claiming 10M [src_059].

The bottom line for the canonical block. The takeaway for this chapter is that the block-level picture you just learned still describes what is in production. The deltas are at the layer-mixing level: which blocks alternate, how positions are encoded across them, and how aggressively the FFN is sparsified. The table above remains a faithful late-2024 snapshot; this subsection records what is at the open-weights frontier as of April 2026.

Tokenization

Modern decoder LLMs use byte-level Byte Pair Encoding tokenizers — the "tiktoken style" inherited from the GPT line. (tiktoken is OpenAI's open-source BPE tokenizer library, originally shipped with GPT-3.5 and GPT-4; the 100k base vocabulary that Llama-3 extends is the cl100k_base encoding.) The byte-level base means the tokenizer can encode any input string, including code, math symbols, emoji, and arbitrary multilingual text, without ever falling back to an unknown-token marker; the BPE merges then compose those bytes into more efficient sub-word tokens for common English and the tokenizer's training-data languages [src_030].

Llama-3 uses a 128k-vocabulary tokenizer that combines the 100k tokens from the OpenAI tiktoken vocabulary with 28k additional tokens chosen to improve non-English coverage. The compression rate — characters per token, averaged over a representative corpus — improves from 3.17 in Llama-2's 32k tokenizer to 3.94 in Llama-3's 128k tokenizer [src_030]. That number matters because it determines, at fixed text content, how many tokens a sequence costs. A higher compression rate translates directly into more text fitting in the same context window, lower inference cost per character, and faster training throughput per training token.

DeepSeek-V3 also uses a 128k-class BPE tokenizer with byte-level fallback [src_031]. ModernBERT, on the encoder side, uses a 50,368-token modified OLMo BPE — the vocabulary size is deliberately a multiple of \(64\) so that the embedding matrix tiles cleanly on GPU hardware [src_016]. This is a recurring micro-optimization across the modern component stack: vocabulary sizes, head counts, and hidden dimensions are all chosen as multiples of \(64\) or \(128\) where possible.

The historical alternatives are worth naming briefly. BERT and the encoder-only line used WordPiece, a related sub-word algorithm with slightly different merge rules; T5 and the encoder-decoder line used SentencePiece, which folds whitespace into the tokenizer alphabet and operates on raw text rather than pre-tokenized words. The mechanistic contrast is sharp: WordPiece runs over already-whitespace-tokenized words and merges sub-word pieces drawn from a learned vocabulary; SentencePiece operates on raw text streams (whitespace included as a regular character) and is language-agnostic at the input boundary; tiktoken-style byte-level BPE goes one level lower still, treating the input as a stream of bytes so any byte sequence is encodable. Both have been displaced on the decoder-only side by tiktoken-style byte-level BPE because byte-level handling avoids the language-coverage failure modes that pre-tokenized algorithms produce on code and on languages without clean whitespace tokenization.

Inference pipeline: prefill and decode

A modern decoder-only model at inference splits its work into two phases that have very different performance characteristics. The first phase, prefill, processes the entire prompt in a single parallel forward pass. Every token in the prompt attends to every earlier token under the causal mask, all in one matrix-multiply-heavy pass through the stack. The byproduct of prefill is the KV-cache: for every layer, for every prompt token, the projected keys and values are written to a per-layer cache so they do not need to be recomputed [src_002, src_006].

The second phase, decode, generates one new token at a time. Each decode step reads the current token's embedding, runs it through the stack, and at the attention sublayer of each layer projects the new token to a query, key, and value. The new key and value are appended to the layer's KV-cache. The query attends to all keys and values in the cache up to and including the current step. After the FFN and unembedding the model emits a logit distribution over the vocabulary, samples the next token, and the loop repeats.

Decode is fundamentally different work from prefill. Each decode step is a single token moving through the stack, but it must read the KV-cache for every prior token at every layer. The arithmetic intensity — the ratio of floating-point operations to memory traffic — is low. Decode is memory-bandwidth-bound on modern accelerators: the KV-cache reads dominate, and the FLOPs are largely idle [src_006]. This is the fact that motivates almost every inference optimization in the modern stack — KV-cache compression, GQA / MQA / MLA, paged-attention allocators, speculative decoding. (Paged-attention is the vLLM-introduced KV-cache allocator that breaks the cache into fixed-size pages and uses indirection tables, enabling efficient memory sharing across concurrent requests.) Chapter 4 develops the bandwidth analysis in detail.

The size of the KV-cache, recapped from Chapter 4, is

\[ \text{KV-cache bytes per token} = 2 \cdot L \cdot h_{kv} \cdot d_h \cdot b \]

where \(L\) is the number of layers, \(h_{kv}\) the number of key/value heads, \(d_h\) the per-head dimension, and \(b\) the bytes per stored element (the leading \(2\) counts both K and V).

🤔 Pause and reflect

Pause before substituting. Given the formula above and Llama-3 70B's \(L = 80\), \(h_{kv} = 8\), \(d_h = 128\), fp16 storage (\(b = 2\)): predict the per-token KV-cache size to within an order of magnitude — is it tens of KB, hundreds of KB, or megabytes? Then check against the next equation.

For Llama-3 70B with \(L = 80\), \(h_{kv} = 8\), \(d_h = 128\), and \(b = 2\) (fp16 storage), this is

\[ 2 \cdot 80 \cdot 8 \cdot 128 \cdot 2 = 327{,}680 \text{ bytes per token}, \]

or roughly \(320\) KiB per token of context. At a context length of \(8{,}192\) that is approximately \(2.5\) GiB of KV-cache per sequence.

🎯 Intuition

An 8B-parameter model in fp16 weighs about \(16\) GiB. At \(128{,}000\) tokens of context, the per-sequence KV-cache for a Llama-3 70B GQA-8 configuration is roughly \(40\) GiB — more than twice the weight footprint of an 8B model. This is the calibrated punchline of the formula above: at long contexts, the cache stops being a footnote and starts being the dominant memory term.

At the longer context lengths the model later supports (up to \(128{,}000\) tokens after RoPE-base extension), the KV-cache exceeds the weights of an 8B model. Configuration numbers from [src_030]; the formula and its motivation were developed in Chapter 4. This linear scaling in \(h_{kv}\) is the architectural reason GQA exists: collapsing \(h\) key-value heads into \(h_{kv} = h/g\) heads gives an immediate \(g\)-fold reduction in cache memory.

🔄 Recap

Retrieval prompts on §5's prefill/decode split, the arithmetic-intensity asymmetry, and the KV-cache size formula:

  • Complete the formula. KV-cache bytes per token = \(2 \cdot \_\_\_ \cdot \_\_\_ \cdot \_\_\_ \cdot \_\_\_\). Name what each factor counts.
  • Explain why decode is memory-bandwidth-bound. What ratio of what two quantities is "low" at decode time, and why does lowering that ratio motivate KV-cache compression rather than faster matmul kernels?
  • Predict the per-sequence cache. For a hypothetical model with \(L = 32\), \(h_{kv} = 8\), \(d_h = 128\), fp16 storage, at \(32{,}768\) tokens of context — within an order of magnitude, what is the per-sequence KV-cache?

Speculative decoding

Memory-bandwidth-bound decode admits a clever amortization trick. Speculative decoding uses a small draft model to propose a sequence of \(K\) candidate next tokens, and then runs the large target model in a single parallel forward pass over the proposed sequence. That pass yields the target model's distribution at each position; the procedure keeps the longest prefix of the proposed tokens that the target model would also have produced, and discards the rest. The net effect is that, when the draft model's predictions agree with the target's, the target model emits multiple tokens per its own forward pass [src_006].

The win is throughput. Each rejected token costs a target-model forward pass that produced no new output, but the verification pass itself is cheap relative to \(K\) separate decode steps because it is arithmetic-intensity-friendly: the target evaluates \(K\) tokens in parallel, and the KV-cache is read once for the whole batch of positions instead of \(K\) times. Net throughput gain occurs when the draft acceptance rate, multiplied by the per-token draft latency, is less than the target latency divided by \(K\). Implementation details and the gpt-fast reference implementation are deferred to Appendix B. DeepSeek-V3 builds a related capability into its base model in the form of multi-token prediction during training — MTP, predicting the next \(k\) tokens jointly from each position via auxiliary heads on the shared backbone rather than the standard autoregressive next-1; §8 returns to MTP in the DeepSeek-V3 highlights [src_031].

The post-Chinchilla over-training story

The Chinchilla scaling laws, which Chapter 9 develops, identify a compute-optimal allocation of data and parameters: roughly \(20\) training tokens per parameter. By that yardstick, Llama-3 8B trained on \(15 \times 10^{12}\) tokens has a tokens-per-parameter ratio close to \(1{,}875\) — almost two orders of magnitude past the compute-optimal point [src_030]. Llama-3 70B at the same training-data scale is also clearly past Chinchilla-optimal, though by a smaller factor.

This is not a mistake. Chinchilla optimizes total training compute under a fixed budget. A frontier lab that has to serve a model to users for months or years pays inference compute too, and inference compute is paid per token of generated output, every output, forever. A smaller model trained on more tokens than Chinchilla recommends has worse training-compute efficiency but lower inference cost, and the trade tilts hard toward inference once the model is deployed at scale. The Llama-3 herd report makes this argument explicitly when it justifies the over-training regime [src_030]. Chapter 9 develops the full scaling-laws picture and quantifies the trade-off.

🔗 Connection

The Chinchilla scaling law and the post-Chinchilla over-training regime are derived in Chapter 9. The DeepSeek-V4 / Llama-4 frontier numbers in the §3 H3 above (sparsity ratios, native context lengths, RoPE bases at 1M) sit on the same over-training side of the line, several years on.

💡 Key result

When inference cost dominates lifetime cost, training a smaller model on more data than Chinchilla recommends — Llama-3 8B on 15T tokens, ratio ≈ 1875 — is a deliberate trade of training-compute efficiency for inference-compute economy.

DeepSeek-V3 highlights

DeepSeek-V3 is the most architecturally distinctive frontier-scale model of late 2024, and it is worth surfacing what makes it different [src_031]. Three deltas matter.

The first is Multi-head Latent Attention (MLA). Instead of caching full per-head keys and values, MLA caches a low-rank latent representation that the attention sublayer expands back into per-head keys and values on the fly. The dimensionality of the latent vector is substantially smaller than \(h_{kv} \cdot d_h\), so the KV-cache memory per token shrinks accordingly. MLA is, in a sense, the next move along the same axis as MHA → MQA → GQA: each step trades a small representational concession for a large reduction in KV-cache memory [src_031]. Chapter 4 develops the attention-variant landscape; Chapter 10 revisits MLA in the context of MoE inference.

The second is Mixture-of-Experts with auxiliary-loss-free load balancing. DeepSeek-V3's FFN is replaced with a stack of experts and a router that sends each token to a subset of those experts. The \(671 \times 10^9\) total parameters in the model include all experts; only the \(37 \times 10^9\) active parameters per token participate in any given forward pass [src_031]. Earlier MoE designs added an auxiliary loss term to the training objective to keep experts evenly utilized; DeepSeek-V3 instead uses a bias-correction scheme on the router that achieves balance without an auxiliary loss. The mechanics belong to Chapter 10.

The third is the training scale itself. DeepSeek-V3 was pretrained on \(14.8 \times 10^{12}\) tokens using approximately \(2.788 \times 10^6\) H800 GPU-hours [src_031]. That number is one of the most concretely documented frontier-scale training budgets in the open literature, and it grounds Chapter 9's discussion of effective training-compute efficiency at MoE scale.

Closing summary

The architecture has converged. A 2024-2025 decoder-only LLM is a stack of pre-RMSNorm Transformer blocks with RoPE on Q/K, a grouped attention variant for KV-cache compression, and a SwiGLU FFN. The remaining axes of differentiation are the attention variant (GQA across the Llama family, MLA in DeepSeek-V3), the dense-vs-MoE choice, the tokenizer's vocabulary and coverage, and the training-data mix. Components that this chapter only named in passing — RoPE base frequencies, FlashAttention kernels, the exact form of pre-RMSNorm — were developed in detail in Chapters 2 through 4 [src_046].

The next part of the book is about scale. Chapter 9 derives the Chinchilla scaling laws and walks through the post-Chinchilla over-training regime that the Llama-3 herd inhabits [src_030]; Chapter 10 develops mixture-of-experts and the auxiliary-loss-free balancing of DeepSeek-V3 [src_031]. After scale comes alignment: Part VI takes the pretrained decoder backbone produced by the recipes here and converts it into an instruction-following, preference-aligned, and reasoning-capable system. The pretrained model that this chapter assembles is the substrate on which all of that further work happens.

🔗 Connection

Scale: Ch.9 scaling laws and Ch.10 mixture-of-experts. Alignment: Part VI of the book covers SFT/RLHF, DPO, and reasoning models. The pretrained decoder this chapter assembles is the substrate for everything that follows.

🔄 Recap

Chapter-level retrieval prompts on the centrepiece content (§2 block + §3 table + §5 KV-cache + §6 speculative decoding + §7 over-training):

  • Complete the convergence list. Which four components are shared by every modern decoder column in the §3 table?
  • Complete the calculation. What is the per-token KV-cache size for Llama-3 70B (in KiB), and why does that number rise sharply at long contexts?
  • Explain why over-training (training a smaller model on far more data than Chinchilla recommends) makes economic sense at deployment scale.
  • Compare the block-level picture you learned in §2 with the layer-mixing-level picture the §3 H3 frontier subsection introduces. Which level is converged across vendors in April 2026, and which level is the active design space?

References