Modern Normalization and Activations¶

Chapter 1 fixed the skeleton of a modern decoder block: a residual stream punctuated by attention sublayers and feed-forward sublayers, with normalization placed before each sublayer rather than after. Chapter 2 swapped learned absolute position embeddings for RoPE inside the attention path. This chapter swaps two more components that the original 2017 Transformer specified one way and that every Llama, Qwen, Gemma, and DeepSeek model now specifies differently:

the normalization itself, which is no longer LayerNorm but RMSNorm;
the feed-forward sublayer, which is no longer a two-matrix ReLU or GELU stack but a three-matrix gated structure called SwiGLU.

Neither change is theoretically deep. Both are empirical wins, both were proposed years before they were widely adopted, and both became standard not because the math forced them but because the open-weights ecosystem coordinated around the same recipe at scale.

1. Why renormalize at all¶

🔗 Connection

Pre-norm placement and the residual stream were introduced in Chapter 1; rotary position embeddings (RoPE), the attention-side swap referenced above, were derived in Chapter 2. Both are prerequisites for §9's consolidated block.

We retain the pre-norm convention from Chapter 1: each sublayer reads as \(y = x + \mathrm{Sublayer}(\mathrm{Norm}(x))\), with Norm applied to the residual stream just before attention or before the FFN, and the residual added unnormalized [src_010, src_002]. The job of the normalization is to keep the magnitude of the activations and gradients bounded across depth so that training does not diverge — not to reshape the distribution into anything statistically pristine [src_018]. The original "internal covariate shift" framing has been challenged by subsequent work, and the operational reading that survives is simpler: normalization controls activation magnitude, and that control is what stabilizes optimization [src_018].

This chapter is about the what inside the Norm and inside the FFN. Pre-norm placement is fixed.

2. From LayerNorm to RMSNorm¶

🎯 Intuition

LayerNorm subtracts the mean across the \(D\) feature axis, then divides by the standard deviation, then re-applies a per-feature gain and bias. Geometrically, the subtraction projects the activation vector onto the hyperplane perpendicular to the all-ones direction in feature space — it removes whatever component of the vector points along \((1, 1, \dots, 1)\). The next section asks whether the residual stream already lives on that hyperplane, in which case the subtraction is doing zero work.

LayerNorm, as introduced for Transformers, normalizes a token's activations \(a \in \mathbb{R}^D\) across the feature dimension by subtracting the mean and dividing by the standard deviation, then applies a learned per-feature gain \(g\) and bias \(b\):

\[ \mathrm{LayerNorm}(a)_i \;=\; \frac{a_i - \mu}{\sigma} \cdot g_i + b_i, \qquad \mu = \frac{1}{D}\sum_{j=1}^{D} a_j, \qquad \sigma = \sqrt{\frac{1}{D}\sum_{j=1}^{D} (a_j - \mu)^2 + \varepsilon}. \]

🤔 Pause and reflect

Before reading on, predict — if you delete the mean term (\(a_i - \mu \to a_i\)) from LayerNorm but keep the standard-deviation rescaling, which of the two invariances (re-centering, re-scaling) survives, and why? (Do not look ahead — write the answer down or say it out loud.)

Two invariances fall out of this definition. Re-centering invariance: shifting the input or the upstream weights by a constant leaves the LayerNorm output unchanged. Re-scaling invariance: multiplying the input or the upstream weights by a positive constant also leaves the output unchanged [src_018].

Zhang and Sennrich asked which of these two invariances is doing the actual stabilization work. Their hypothesis was that re-centering is dispensable: the residual stream in a Transformer already absorbs additive shifts because every sublayer adds back to it, so subtracting the mean again at every Norm is largely redundant [src_018]. They proposed dropping the mean term and the additive bias and keeping only the second moment:

\[ \mathrm{RMSNorm}(x)_i \;=\; \frac{x_i}{\mathrm{RMS}(x)} \cdot \gamma_i, \qquad \mathrm{RMS}(x) = \sqrt{\frac{1}{D}\sum_{j=1}^{D} x_j^{2} + \varepsilon}, \]

where \(\gamma \in \mathbb{R}^{D}\) is the learned per-feature gain and \(\varepsilon\) is a small numerical stabilizer (typically \(10^{-5}\) or \(10^{-6}\)) [src_018]. There is no learned bias and no mean subtraction. When the input mean happens to be zero, RMSNorm and LayerNorm coincide exactly; when it does not, RMSNorm sacrifices re-centering invariance and keeps only re-scaling invariance [src_018].

The savings are visible at the level of the formula itself. LayerNorm needs two passes over the feature dimension (one for \(\mu\), one for \(\sigma\)) plus a subtraction, while RMSNorm needs one pass (the sum of squares) and no subtraction; LayerNorm has \(2D\) learned parameters per layer (\(g\) and \(b\)), while RMSNorm has \(D\) (\(\gamma\) alone) [src_018].

3. The empirical case for dropping the mean¶

The published case for RMSNorm is empirical, not theoretical. Zhang and Sennrich's Table 1 reports running-time reductions of 7%–64% across machine translation (RNNSearch, Transformer), image classification, image-caption retrieval, and question answering, with downstream quality matching LayerNorm to within run-to-run noise [src_018]. They additionally show that estimating the RMS from a 6.25% subset of the features — a partial RMSNorm — remains competitive [src_018]. Stanford's CS336 lecture notes treat RMSNorm as the de facto modern choice, citing the same speed-without-quality-loss result as the reason the Llama family adopted it [src_004].

Two things are worth being honest about. First, the speedup is wall-clock, not theoretical: the gain comes from fewer reductions and fewer learned parameters, not from a stronger mathematical guarantee. Second, the quality parity claim is comparable, not better. Nothing in the original RMSNorm paper argues that throwing away the mean improves modeling; it argues only that you do not lose anything by doing so [src_018]. That is the framing modern practitioners inherit.

💡 Key result

RMSNorm matches LayerNorm on quality and runs faster — but the speedup is wall-clock from fewer reductions, not a stronger mathematical guarantee.

🔄 Recap

Write down the RMSNorm formula given LayerNorm's: which terms disappear, and what is the parameter count per layer compared with LayerNorm's \(2D\)?
Of the two LayerNorm invariances (re-centering, re-scaling), which one survives in RMSNorm, and why does the residual stream make that survivor sufficient?
Predict: if you also dropped the learned gain \(\gamma\) from RMSNorm, what would the operator reduce to, and what stabilization role would be lost?

4. From ReLU and GELU to gated activations¶

The second swap concerns the position-wise feed-forward network (the FFN is applied independently to each token position; no cross-position interaction). The original 2017 Transformer specified an FFN of the form

\[ \mathrm{FFN}_{\text{ReLU}}(x) \;=\; \mathrm{ReLU}(x W_{1}) \, W_{2}, \]

with the convention \(W_1 \in \mathbb{R}^{D \times d_{ff}}\), \(W_2 \in \mathbb{R}^{d_{ff} \times D}\), and \(d_{ff} = 4D\) for the base model [src_019]. Subsequent work substituted ReLU with smoother non-monotonic alternatives: GELU, defined by \(\mathrm{GELU}(x) = x \cdot \Phi(x)\) where \(\Phi\) is the standard normal CDF, and Swish (also called SiLU), defined by \(\mathrm{Swish}_{\beta}(x) = x \cdot \sigma(\beta x)\) with \(\sigma\) the logistic sigmoid [src_019]. Throughout this chapter and throughout most modern papers, \(\mathrm{Swish}\) means \(\mathrm{Swish}_{1}\) — the same function as \(\mathrm{SiLU}\). Both Swish and GELU are pointwise drop-in replacements for ReLU, and both leave the two-matrix structure of the FFN intact.

⚠️ Pitfall

Three names refer to the same function: \(\mathrm{Swish}_{1}\), \(\mathrm{SiLU}\), and PyTorch's \(\mathrm{F.silu}\). The "Swi" in SwiGLU is exactly this \(\beta = 1\) Swish, not a different gate. Encountering all three names across papers, codebases, and the §8 listing without flagging the equivalence is a common confusion.

4.1 Gating: from one path to two¶

🎯 Intuition

A pointwise activation (ReLU, GELU, Swish) is one path: project, then nonlinearity, then project back. A gated activation is two paths multiplied: project once into a value, project again into a gate, pass the gate through a scalar nonlinearity, then multiply the two elementwise. The gate is data-dependent — it lets each feature decide, per token, how much of the value to keep. Pointwise activations cannot do this; their decision is fixed by the activation function.

Shazeer's 2020 contribution was orthogonal: instead of replacing ReLU with a different pointwise nonlinearity, replace the first linear-then-nonlinearity stage with a gated structure in the spirit of Dauphin et al.'s Gated Linear Units. The general GLU layer is

\[ \mathrm{GLU}(x, W, V, b, c) \;=\; \sigma(x W + b) \,\odot\, (x V + c), \]

i.e., the elementwise product of two linear projections of \(x\), where one projection has been passed through a sigmoid [src_019].

🤔 Pause and reflect

Given the general GLU form \(\sigma(x W + b) \odot (x V + c)\), predict what changes when the gate function \(\sigma\) is replaced by ReLU, GELU, or Swish. Which replacement preserves the gating intuition (data-dependent multiplicative weight) while removing the saturating sigmoid? (Do not look ahead.)

The variants come from substituting the sigmoid gate with a different scalar nonlinearity:

\[ \begin{aligned} \mathrm{ReGLU}(x, W, V, b, c) &= \mathrm{ReLU}(x W + b) \odot (x V + c), \\ \mathrm{GeGLU}(x, W, V, b, c) &= \mathrm{GELU}(x W + b) \odot (x V + c), \\ \mathrm{SwiGLU}(x, W, V, b, c, \beta) &= \mathrm{Swish}_{\beta}(x W + b) \odot (x V + c). \end{aligned} \]

Plugging these into the FFN, with the conventional bias-free T5 setup, gives the family of feed-forward variants Shazeer tested [src_019]:

\[ \mathrm{FFN}_{\text{SwiGLU}}(x, W, V, W_{2}) \;=\; \big(\mathrm{Swish}_{1}(x W) \odot x V\big) \, W_{2}. \]

There are three weight matrices now, not two: the gate projection \(W\), the value projection \(V\), and the output projection \(W_{2}\).

5. What the Shazeer table actually shows¶

Shazeer trained a T5-base model with each FFN variant on the C4 span-filling objective for 524k steps, then fine-tuned on GLUE, SuperGLUE, and SQuAD. Table 1 of the paper reports heldout-set log-perplexity at the end of pre-training; in that table, ReLU and GELU baselines come in at 1.677 and 1.679 respectively, while GeGLU and SwiGLU come in at 1.633 and 1.636 — separated from each other by 0.003, a gap well below the run-to-run inter-seed standard deviations the same paper reports for shorter runs [src_019].

🎯 Intuition

Variant	Heldout log-perplexity
ReLU	1.677
GELU	1.679
GeGLU	1.633
SwiGLU	1.636

The gap between the GLU pair and the pointwise pair (~0.04) is real; the gap inside the GLU pair (0.003) is within run-to-run noise.

The downstream story is similar: GeGLU, SwiGLU, ReGLU, and Bilinear all beat ReLU and GELU on most GLUE and SuperGLUE tasks, but the ordering between the GLU-family entries themselves shifts from task to task and is largely within noise [src_019].

Shazeer is explicit about the absence of a theoretical reason. He notes that the GLU-family layers are simple to implement and have no apparent computational drawback, and offers no further explanation, attributing the wins to "divine benevolence" [src_019]. The point of citing this two-word phrase is not the joke, it is the methodological honesty: the paper presents an empirical ranking and refuses to dress it up as a derivation.

⚠️ Pitfall

Reading the GLU-family results as if they had a theoretical basis confuses methodological honesty with theoretical absence — the absence is the result, not a gap to be filled. Shazeer presents an empirical ranking; no separate analytical argument exists, and any one a reader supplies post hoc is exactly the kind of dressing the original paper refused.

6. Why SwiGLU dominates in practice¶

If GeGLU and SwiGLU are statistically tied in Shazeer's own table, why is SwiGLU the universal choice in the open-weights families that defined the 2023–2026 era? The answer is convention propagation, not measured superiority.

GeGLU was the dominant choice in the earlier wave of large models, including Google's PaLM, where the FFN is gated and uses GELU as the gate activation [src_019]. When Meta released Llama, the design decision was to use SwiGLU rather than GeGLU; Llama-2 and Llama-3 kept SwiGLU; the open-weights ecosystem that grew on top of Llama — Qwen, DeepSeek, Mistral derivatives, and most recent decoder-only families described as Llama-style — inherited that choice [src_002, src_047]. GeGLU did not disappear: it is the gating choice of ModernBERT (a 2024 RoPE-equipped encoder-only revival, surveyed in Chapter 7) and of several T5 derivatives that pre-date the Llama lineage [src_002, src_047]. The two coexist because no benchmark has ever produced a clear separation between them.

The pedagogically honest statement is therefore: SwiGLU is standard because Llama is the de facto reference architecture for open-weights decoders, not because anyone has demonstrated that Swish gates beat GELU gates on a fair held-out test [src_019, src_002, src_047].

🔗 Connection

The open-weights decoder families named in this section — Llama, Qwen, Gemma, DeepSeek, Mistral derivatives — are surveyed in Chapter 8 where the ecosystem-coordination claim is examined in detail.

7. Parameter parity and the 8/3 multiplier¶

A SwiGLU FFN has three weight matrices instead of two, so naively replacing \(\mathrm{FFN}_{\text{ReLU}}\) with \(\mathrm{FFN}_{\text{SwiGLU}}\) at the same hidden width inflates the FFN parameter count by 50%. Shazeer's solution was to rescale the hidden width \(d_{ff}\) to keep the total constant. With the conventional T5 setup \(d_{ff} = 4D\), the two-matrix ReLU FFN has \(2 \cdot D \cdot d_{ff} = 8 D^{2}\) parameters. Two steps are folded into that count: (i) the two-matrix ReLU FFN has \(W_1 \in \mathbb{R}^{D \times d_{ff}}\) and \(W_2 \in \mathbb{R}^{d_{ff} \times D}\), contributing \(2 \cdot D \cdot d_{ff}\) parameters total; (ii) substituting the T5 convention \(d_{ff} = 4D\) gives \(2 \cdot D \cdot 4D = 8 D^{2}\). The three-matrix SwiGLU FFN has \(3 \cdot D \cdot d_{ff}\) parameters; equating the two gives

\[ 3 \cdot D \cdot d_{ff}^{\text{SwiGLU}} \;=\; 8 D^{2} \quad\Longrightarrow\quad d_{ff}^{\text{SwiGLU}} \;=\; \frac{8}{3} D \;\approx\; 2.67\, D. \]

This is the origin of the \(\tfrac{8}{3}D\) FFN width that Llama-family configurations advertise. Shazeer's own paper described the same operation as multiplying the original \(d_{ff}\) by \(\tfrac{2}{3}\): in the T5-base setup he started from \(d_{ff} = 3072\) and reduced to \(d_{ff} = 2048\) to match parameter and computation counts against the two-matrix baseline [src_019]. CS336's normalization-and-MLP lecture treats this rescaling as the canonical SwiGLU recipe and is the textbook reference modern implementations follow [src_004]. In production code, the resulting hidden width is often rounded to the nearest multiple of 64 or 128 to align with tensor-core tile sizes (GPU matrix-multiply units that operate on small fixed-size tiles, typically 16×16 or 16×32), which is why Llama checkpoints sometimes report a hidden width that is close to but not exactly \(\tfrac{8}{3}D\).

The conceptual point is that, under parameter parity, the SwiGLU FFN trades a wider single-matrix bottleneck for a narrower bottleneck that is gated by an extra projection. The empirical claim that this trade is worthwhile is the Shazeer Table 1 result; there is no separate analytical argument [src_019].

💡 Key result

Both swaps in this chapter — RMSNorm and SwiGLU — are empirical wins without separate analytical justifications; the modern block is what the open-weights ecosystem coordinated around, not what theory forced.

8. PyTorch reference implementation¶

The two ingredients of this chapter, RMSNorm and SwiGLU, are short enough that a faithful implementation fits on one page. The listing below follows the style of Raschka's from-scratch implementation reference and the canonical CS336 Lecture 3 implementation [src_010, src_004].

import torch
import torch.nn as nn
import torch.nn.functional as F


class RMSNorm(nn.Module):
    """Root mean square layer normalization (Zhang & Sennrich, 2019).

    Drops the mean term and the additive bias of LayerNorm; keeps a learned
    per-feature gain. The fast form uses rsqrt to avoid an explicit divide.
    """

    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.gamma = nn.Parameter(torch.ones(dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (B, T, D). Compute mean square over the last dim.
        ms = x.pow(2).mean(dim=-1, keepdim=True)
        x_hat = x * torch.rsqrt(ms + self.eps)
        return x_hat * self.gamma


class SwiGLU(nn.Module):
    """SwiGLU feed-forward sublayer (Shazeer, 2020).

    Three weight matrices: w_gate and w_value share the input dim D and
    project to hidden dim d_ff; w_out projects back to D. The hidden dim
    is conventionally 8/3 * D in Llama-family configs to keep parameter
    parity with a 4*D-wide ReLU/GELU FFN.
    """

    def __init__(self, dim: int, hidden_dim: int | None = None):
        super().__init__()
        if hidden_dim is None:
            hidden_dim = int(round(8 * dim / 3))
        self.w_gate = nn.Linear(dim, hidden_dim, bias=False)
        self.w_value = nn.Linear(dim, hidden_dim, bias=False)
        self.w_out = nn.Linear(hidden_dim, dim, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w_out(F.silu(self.w_gate(x)) * self.w_value(x))

A few notes on the listing. torch.rsqrt(ms + eps) is the form fast attention kernels and modern RMSNorm CUDA kernels actually compute: a single reciprocal-square-root instruction is cheaper than a square root followed by a divide, and the resulting tensor is multiplied (not divided) into \(x\) [src_004]. F.silu is PyTorch's name for \(\mathrm{SiLU}\), which is the same function as \(\mathrm{Swish}_{1}\); the "Swi" in SwiGLU therefore refers to this exact gate [src_019]. The three projections are bias-free because the modern decoder-only configurations in the Llama lineage drop bias parameters from the FFN throughout, mirroring the bias-free T5 convention Shazeer used [src_019, src_010].

9. The consolidated modern block¶

Combining this chapter's two substitutions with Chapter 2's RoPE substitution and the GQA substitution that Chapter 4 will derive in detail, a single decoder block in a modern open-weights LLM reads:

input  -> RMSNorm -> Attention(Q, K with RoPE; K, V grouped per GQA) -> + residual
       -> RMSNorm -> SwiGLU FFN                                       -> + residual
output

The attention call shares queries across multiple groups of key/value heads (GQA, derived in Chapter 4), the rotary embedding is applied to Q and K only (not to V) per Chapter 2, and the two RMSNorms operate on the residual stream before each sublayer per the pre-norm convention from Chapter 1 [src_010, src_002, src_047]. The same skeleton appears in Llama, Qwen, Gemma, and DeepSeek decoder configurations, with differences only in dimensions, head counts, and group counts [src_002, src_047].

The modern decoder-only block as deployed in the Llama / Qwen / Gemma / DeepSeek family.

The figure makes the residual topology explicit: every sublayer receives the residual stream as input, normalizes it through RMSNorm before transforming it, and adds the transformed result back into the stream. RoPE is internal to the attention sublayer's \(Q\) and \(K\) pathways. SwiGLU's three projections are visible as the three matrices inside the FFN sublayer.

This concludes the modernization tour of the components inside one block. Chapter 4 turns the camera around and asks what changes when these blocks are stacked deep enough that attention itself becomes the dominant inference-time cost — at which point GQA, FlashAttention, and KV-cache engineering become unavoidable.

🔗 Connection

Three forward references in this section all point at the same chapter: GQA (the grouped-key/value attention shown in the consolidated block), FlashAttention (the IO-aware exact-attention kernel), and KV-cache (the autoregressive decoding cache). All three are derived in Chapter 4.

🔄 Recap

Complete: RMSNorm differs from LayerNorm by dropping ___ and ___, retaining only ___ per feature; this leaves it with the ___ invariance and discards the ___ invariance.
Compare: name the three weight matrices of a SwiGLU FFN and their roles, and contrast with the two-matrix ReLU FFN.
Derive: starting from \(2 \cdot D \cdot d_{ff} = 8 D^{2}\) and the parameter-parity constraint with three matrices, recover the \(\tfrac{8}{3}D\) multiplier without re-reading §7.
Predict: which two prerequisite chapters does §9's consolidated block depend on, and which forward chapter introduces GQA, FlashAttention, and KV-cache?

References¶

[src_002] Xiao, T. and Zhu, J. (2025). Foundations of Large Language Models. arXiv:2501.09223v2. URL: https://arxiv.org/pdf/2501.09223
[src_004] Hashimoto, T. and Liang, P. (2025). Stanford CS336: Language Modeling from Scratch (Spring 2025). URL: https://stanford-cs336.github.io/spring2025/
[src_010] Raschka, S. (2024). Build a Large Language Model (From Scratch). Manning Publications. URL: https://github.com/rasbt/LLMs-from-scratch
[src_018] Zhang, B. and Sennrich, R. (2019). Root Mean Square Layer Normalization. arXiv:1910.07467. URL: https://arxiv.org/pdf/1910.07467
[src_019] Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv:2002.05202. URL: https://arxiv.org/pdf/2002.05202
[src_047] Grigorov, D. (2026). Building Large Language Models from Scratch. Apress. URL: https://doi.org/10.1007/979-8-8688-2297-1