The Transformer Block Revisited¶

1. Why revisit the Transformer¶

The architecture introduced by Vaswani et al. in 2017 has become the de facto substrate for sequence modelling, replacing recurrent and convolutional alternatives in machine translation, language modelling, and most downstream natural-language tasks the field cared about by the time the paper was first cited at scale [src_012]. The same template later expanded to vision, audio, code, and protein modelling, and the encoder-only and decoder-only families that came to dominate language modelling are best read as restrictions of the original encoder-decoder design [src_012]. A book that surveys modern deep learning therefore needs a clean reference point against which everything else in this volume can be compared.

That reference point is what Chapter 1 builds. Every subsequent chapter is most economically described as a delta from the 2017 base block: the rotary position encoding of Chapter 2 replaces the sinusoidal/learned absolute schemes, the RMSNorm and SwiGLU choices of Chapter 3 replace LayerNorm and ReLU+linear, and the FlashAttention, GQA, and MQA mechanisms of Chapter 4 change how attention is computed without changing what it computes. The 2017 architecture is no longer used unmodified at scale; modern large language models consistently swap the position encoding, the normalisation, the activation, and often the attention pattern, while leaving the residual-stream skeleton intact [src_002]. Establishing that skeleton precisely is the work of the present chapter.

The text below assumes the reader has met the building blocks of deep learning before — gradient descent, backpropagation, batch normalisation, residual connections, the cross-entropy loss — at the level of a graduate machine-learning course or a canonical foundations reference such as Bishop and Bishop [src_001]. Readers who want a runnable from-scratch implementation of every component named below should consult Raschka's notebooks alongside this chapter [src_010].

2. Notation and setup¶

🎯 Intuition

Picture the activation tensor $(B, T, D)$ as the through-line of every Transformer forward pass: $B$ sequences in a batch, each of $T$ tokens, each token a $D$-dimensional vector. Every block in the stack reads a tensor of this shape and emits a tensor of the same shape — the residual connections force this. The per-block sub-layers (attention, FFN, normalisation) are each shape-preserving maps that operate along the $D$-axis, so reading the chapter as a whole reduces to: "what does each sub-layer do to the $D$-axis?".

We commit to one notation table for the whole book. Let $B$ denote the batch size, $T$ the sequence length, $D$ the model (residual-stream) dimension, $h$ the number of attention heads, and $d_h = D/h$ the per-head dimension; let $d_{ff}$ be the feed-forward hidden width, related to $D$ by an expansion ratio $r$ such that $d_{ff} = r \cdot D$. The model has $L$ stacked blocks. A token's hidden state at layer $\ell$ is written $x^{(\ell)} \in \mathbb{R}^{D}$, and the full activation tensor passing through the network has shape $(B, T, D)$.

This is the tensor-style notation that aligns most cleanly with modern Transformer implementations and with Chapters 2-4 of this book. The 2025 monograph by Xiao and Zhu uses an equivalent set of symbols — $m$ for sequence length, $d$ for hidden size, $n_{\text{head}}$ for number of heads, $d_{ffn}$ for FFN hidden size, and $L$ for depth [src_002] — and we translate between conventions whenever a Xiao-Zhu quotation appears.

For the worked example that runs through this chapter we use Vaswani's base configuration: $L = 6$, $D = 512$, $h = 8$, $d_h = 64$, $d_{ff} = 2048$, so the expansion ratio is $r = 4$ [src_012]. The "big" variant of the 2017 paper has $D = 1024$, $h = 16$, $d_{ff} = 4096$, again with $L = 6$ [src_012].

3. Scaled dot-product self-attention¶

A self-attention layer takes a sequence of $T$ tokens, each represented as a $D$-dimensional vector, and produces a new sequence of $T$ vectors of the same dimension, where every output vector is a weighted average of value projections of every input vector. The mechanism Vaswani et al. introduced — and the only attention mechanism this book treats as primitive — is scaled dot-product attention [src_012].

Concretely, let $X \in \mathbb{R}^{T \times D}$ be the input sequence (we drop the batch dimension throughout this section to lighten notation). Three learned linear projections produce queries, keys, and values: $$Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V,$$ with $W_Q, W_K \in \mathbb{R}^{D \times d_k}$ and $W_V \in \mathbb{R}^{D \times d_v}$. In standard practice $d_k = d_v$, and we write $d_k$ for both. Scaled dot-product attention then computes $$\text{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V,$$ where the softmax is applied row-wise [src_012].

3.1 Why the $1/\sqrt{d_k}$ scale¶

The scale factor is not cosmetic. Throughout this section we use informative regime to mean the regime in which no single softmax entry dominates and the softmax's gradient with respect to its inputs is non-vanishing — the opposite of the saturated regime that this scaling argument seeks to avoid. Suppose the components of any query vector $q$ and key vector $k$ are independent random variables with mean zero and variance one. Then their dot product $q \cdot k = \sum_{i=1}^{d_k} q_i k_i$ is a sum of $d_k$ independent zero-mean unit-variance products, so $\mathrm{E}[q \cdot k] = 0$ and $\mathrm{Var}[q \cdot k] = d_k$ [src_012]. Without rescaling, the typical magnitude of the pre-softmax logit therefore grows like $\sqrt{d_k}$, and as $d_k$ increases the softmax saturates: a few entries dominate, the others receive vanishing weight, and the gradient of the softmax with respect to its inputs collapses to near zero in the saturated entries.

🤔 Pause and reflect

Before reading on: for $d_k = 1$ versus $d_k = 1024$, how much does the typical magnitude of $q \cdot k$ change, and what does that imply for the softmax row that consumes these logits? (Do not look ahead — write the answer down or say it out loud.)

Concretely, at $d_k = 64$ the typical logit has magnitude $\sqrt{64} = 8$, so a logit gap of $8$ between two rows is unremarkable — and passing such a gap through a row-wise softmax yields $e^{8}/(e^{8} + e^{0}) \approx 0.9997$, an already saturated weight on the larger entry. At $d_k = 8$ the typical magnitude is $\sqrt{8} \approx 2.8$ and the corresponding softmax is far gentler. Dividing the logits by $\sqrt{d_k}$ restores unit variance and keeps the softmax in its informative regime [src_012]. This argument is a one-line consequence of the central-limit intuition for inner products, and it explains why ablations comparing dot-product attention with and without the scale find a large gap at high $d_k$ and a small one at low $d_k$ [src_012].

💡 Key result

Scaling the pre-softmax logits by $1/\sqrt{d_k}$ keeps the softmax in its informative regime regardless of head dimension.

🔄 Recap

Complete the equation: if $q$ and $k$ have i.i.d. unit-variance entries, then $\mathrm{Var}[q \cdot k] = \;?$
Explain why a row of pre-softmax logits with typical magnitude $\sqrt{d_k}$ produces a peaked softmax for large $d_k$.
Predict: if $d_k$ doubles from $64$ to $128$, by what factor does the typical pre-softmax logit grow? What scale factor undoes the change?

3.2 Causal masking¶

Self-attention as written above is permutation-equivariant in the keys and values: every output position can attend to every input position. For autoregressive decoders this is wrong, because the prediction at position $i$ must not depend on tokens at positions $j > i$. The 2017 paper implements the constraint inside the softmax: before the row-wise softmax is applied, the entries of $QK^{\top}/\sqrt{d_k}$ corresponding to disallowed connections are set to $-\infty$, which sends their post-softmax weights to zero [src_012]. We will refer to this throughout the book as causal self-attention. Xiao and Zhu describe the same mechanism and explicitly note that masking is what enables a single Transformer block to be used both bidirectionally (as in BERT — predicting all positions at once given the full surrounding context) and autoregressively (as in GPT — predicting one position at a time conditioned on the prefix), with the difference reduced to a binary mask [src_002].

🔗 Connection

The encoder-only / decoder-only / encoder-decoder split that this binary-mask reading anticipates is taken up in Chapter 7, where each family is examined as a restriction of the 2017 encoder-decoder skeleton.

4. Multi-head attention¶

A single attention head with full dimensionality $D$ has only one shared subspace in which to compare queries and keys. Vaswani et al. argue that this is restrictive, and propose to replace the single head with $h$ parallel heads operating on lower-dimensional projections [src_012]. With $h$ heads of per-head dimension $d_h = D/h$, the layer computes $$\text{MultiHead}(Q, K, V) = \mathrm{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W_O,$$ where each $\text{head}_i = \text{Attention}(Q W_Q^{(i)}, K W_K^{(i)}, V W_V^{(i)})$ uses its own learned projections $W_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{D \times d_h}$ and $W_V^{(i)} \in \mathbb{R}^{D \times d_h}$, and $W_O \in \mathbb{R}^{D \times D}$ is a final output projection [src_012]. Because each head's per-token cost is $\mathcal{O}(T^2 d_h) = \mathcal{O}(T^2 D / h)$ and the heads run in parallel, the total compute of multi-head attention is comparable to single-head attention with full dimensionality $D$ [src_012].

The qualitative gain Vaswani et al. attribute to multiple heads is the ability to attend to information from different representation subspaces at different positions; with one head the averaging that happens inside the softmax inhibits this kind of multi-channel routing [src_012]. The base 2017 model uses $h = 8$, $d_h = 64$, and so $D = h \cdot d_h = 512$; the big variant uses $h = 16$ and $D = 1024$ with the same $d_h$ [src_012].

5. Position-wise feed-forward network¶

Every Transformer block contains a two-layer position-wise feed-forward network, applied independently and identically to each token position: $$\text{FFN}(x) = \max(0, x W_1 + b_1)\, W_2 + b_2,$$ with $W_1 \in \mathbb{R}^{D \times d_{ff}}$, $W_2 \in \mathbb{R}^{d_{ff} \times D}$, and a pointwise ReLU between them [src_012]. The layer is "position-wise" in the sense that the same parameters $W_1, b_1, W_2, b_2$ are shared across positions within a block, but each block has its own copy [src_012]. Equivalently, the FFN is a pair of $1 \times 1$ convolutions with a ReLU between them [src_012].

The base 2017 model sets $D = 512$ and $d_{ff} = 2048$, which is the canonical $4\times$ expansion ratio: $r = d_{ff} / D = 4$ [src_012]. Xiao and Zhu describe the same convention as the typical setting across BERT and pre-LLM Transformers, $d_{ffn} \approx 4d$ in their notation [src_002].

The intuition for the FFN's role is that self-attention alone is a linear-in-values operation — every output of an attention layer is a convex combination of value projections of inputs — so a stack of attention-only layers would saturate in expressivity once one chained enough of them. The FFN's pointwise non-linearity is what prevents this representation degeneration [src_002]. In Xiao and Zhu's accounting the FFN also carries the bulk of the parameter budget: with $r = 4$, the two FFN matrices alone account for $2 \cdot 4 \cdot D^2 = 8 D^2$ parameters per block, against $4 D^2$ for the four attention projections combined, so two-thirds of the per-block weights live in the FFN [src_002].

⚠️ Pitfall

"Linear in $V$" is a statement about the values, not about the attention layer overall — the softmax over $QK^\top$ is the only non-linearity inside the attention sub-block, and it acts on the weights (which mix value vectors) rather than on the values themselves. Stacking attention-only blocks therefore composes a sequence of weighted-mixing operations whose expressivity does indeed saturate, but the precise statement is more delicate than "linear-in-values" suggests.

The choice $r = 4$ is conventional rather than principled. Vaswani et al. state the value without justifying the ratio; Xiao and Zhu state $r \approx 4$ as the typical setting without offering an ablation [src_012, src_002]. Modern decoder-only LLMs with gated FFNs (SwiGLU and its siblings) reduce the ratio to roughly $8/3 \approx 2.67$ to keep the total FFN parameter count comparable to the ungated $4\times$ baseline, since gated FFNs use three matrices instead of two; the Llama 2 technical report documents this rebalancing explicitly as the convention adopted by the modern decoder-only family [src_029]. Chapter 3 develops the gated-FFN derivation and parameter accounting in detail.

🔗 Connection

The gated-FFN derivation (SwiGLU, GeGLU) and the $r \approx 8/3$ rebalancing for the ungated $4\times$ baseline are the subject of Chapter 3, which also covers how the third matrix changes the parameter accounting at constant FFN-budget.

6. Residual connections, LayerNorm, and the residual stream¶

Each sub-layer in the 2017 Transformer — attention or FFN — is wrapped in a residual connection followed by a layer normalisation. In the original formulation the output of a sub-layer is $$\text{LayerNorm}(x + \text{Sublayer}(x)),$$ and to make the residual sum dimensionally consistent every sub-layer, including the input embedding, produces vectors of dimension $D = 512$ in the base model [src_012]. This places the normalisation outside the residual path; we will return to this design choice in Section 7.

🎯 Intuition

LayerNorm re-centers a hidden vector $h \in \mathbb{R}^D$ to zero mean across its $D$ features, then re-scales it to unit variance, then re-applies a learnable per-feature gain and bias. Geometrically: project the vector onto the unit hypersphere in feature-space (subtracting the mean and dividing by the standard deviation), then stretch and shift each coordinate back with parameters the network learns.

LayerNorm itself, in the form used in the Transformer, computes the mean and variance of a hidden vector $h \in \mathbb{R}^D$ along its feature axis, $$\mu = \frac{1}{D}\sum_{i=1}^{D} h_i, \qquad \sigma^2 = \frac{1}{D}\sum_{i=1}^{D} (h_i - \mu)^2,$$ and then rescales and shifts the standardised vector with learnable per-feature gain $\alpha \in \mathbb{R}^D$ and bias $\beta \in \mathbb{R}^D$: $$\text{LayerNorm}(h) = \alpha \odot \frac{h - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta,$$ where $\odot$ is the elementwise product and $\epsilon$ is a small numerical-stability constant [src_002]. Unlike batch normalisation, LayerNorm is computed per token independently of the batch and is therefore as well-defined at inference time on a single example as it is at training time on a large batch [src_002].

6.1 The residual stream as a viewpoint¶

The residual connection plus LayerNorm is what makes deep stacks of Transformer blocks trainable in practice: without the residual the gradient signal at depth $L$ would have to propagate through $L$ matrix multiplications and $L$ softmaxes, while with the residual the identity term carries gradient back unattenuated [src_001, src_012]. A useful conceptual frame, due to mechanistic-interpretability work that came later, is to think of the activations passing along the residual path as a residual stream: a $D$-dimensional running representation per token, into which each block additively writes a small edit and from which it reads via attention and FFN projections. The base block thus computes $$x \mapsto x + \text{Attn}(x), \qquad x \mapsto x + \text{FFN}(x),$$ modulo the LayerNorm placement. The residual-stream picture is implicit already in Vaswani's residual-plus-LayerNorm formulation and is made explicit in Xiao and Zhu's discussion of why the residual is what enables depth [src_002].

⚠️ Pitfall

The residual-stream reading is cleanest under the pre-norm placement Section 7 will introduce; under the post-norm placement of Vaswani's original formulation, the LayerNorm sits across the residual sum and partly obscures the "identity-plus-edits" semantics. The picture works either way as a heuristic but its theoretical cleanness depends on the placement choice.

🔗 Connection

Chapter 3 takes up RMSNorm as a drop-in replacement for LayerNorm; the residual-stream picture established here is the prerequisite framing under which the RMSNorm placement choice (and its compatibility with pre-norm) becomes legible.

7. Pre-norm versus post-norm¶

The 2017 paper's $\text{LayerNorm}(x + \text{Sublayer}(x))$ is now called the post-norm formulation, because the normalisation is applied after the residual sum [src_012, src_002]. A widely used alternative, the pre-norm formulation, applies the normalisation inside the residual: $$x \mapsto x + \text{Sublayer}(\text{LayerNorm}(x)).$$ Xiao and Zhu present both and note that the post-norm design is the original one used by Vaswani et al. and adopted by the BERT-style encoder-only family, while the pre-norm design has become the dominant choice for deep modern Transformers [src_002].

🤔 Pause and reflect

Compare $\text{LayerNorm}(x + \text{Sublayer}(x))$ and $x + \text{Sublayer}(\text{LayerNorm}(x))$. In which formulation does the gradient flow through the identity term unattenuated, and what does that imply for training a deep stack at initialisation? (Predict before reading the next paragraph.)

The reason most modern large language models use pre-norm is training stability. The mean-field analysis of Xiong et al. (2020) — an analysis of expected gradient norms at initialisation, in which one tracks the typical magnitude of the gradient with respect to each parameter rather than its exact value — shows that at initialisation the post-norm Transformer places disproportionately large expected gradients on parameters near the output layer, which is what makes a carefully tuned warmup schedule necessary (a warmup schedule ramps the learning rate from near zero up to its peak value over the first few thousand steps, then decays it); the same analysis shows pre-norm gradients are well-behaved at initialisation, and pre-norm Transformers trained without warmup match post-norm-plus-warmup baselines while requiring less training time and hyperparameter tuning [src_049]. In residual-stream terms, the pre-norm residual path is unnormalised: gradients flow straight through the identity term, the magnitude of activations along the residual path is controlled by the cumulative effect of the additive sub-layer outputs rather than by repeated normalisations, and warmup-free training therefore becomes practical at depth [src_049, src_002]. Xiao and Zhu accordingly recommend pre-norm for deep Transformers and document that the choice has become near-universal in modern LLMs [src_002].

💡 Key result

Pre-norm is the modern default; the residual-stream picture is what makes that default natural.

🔄 Recap

Explain in your own words why post-norm Transformers required a tuned warmup schedule at initialisation.
Compare: in which formulation (post-norm or pre-norm) does the residual path remain unnormalised, and what does that mean for the residual-stream reading?
Predict: if you swap a post-norm encoder for a pre-norm encoder of the same depth, which hyperparameter becomes less sensitive?

🔗 Connection

The pre-norm placement established here is the prerequisite for Chapter 3's treatment of RMSNorm — RMSNorm is a parameter-cheaper LayerNorm variant that only makes sense as a pre-norm component, not as a post-norm one.

8. Positional encoding: sinusoidal and learned¶

Self-attention is permutation-equivariant: shuffling the input tokens shuffles the output tokens identically, so a Transformer with no auxiliary positional information cannot distinguish "the cat sat on the mat" from "the mat sat on the cat" [src_012]. Vaswani et al. inject position information by adding a positional encoding $PE \in \mathbb{R}^{T \times D}$ to the input token embeddings before the first block.

🎯 Intuition

The picture is a coordinate system, not a learned signal. Each position is encoded as a fixed vector of sines and cosines whose wavelengths span a wide geometric range — from a few tokens at one extreme to roughly $60{,}000$ tokens at the other. Nearby positions get nearly identical encodings (similar phases at every wavelength); distant positions get nearly orthogonal ones. The model can then consult this coordinate system through ordinary additive lookups, without having to learn any position-specific parameter itself.

The 2017 paper uses fixed sinusoids of geometrically spaced wavelengths: $$PE(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/D}}\right), \qquad PE(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/D}}\right),$$ where $\text{pos}$ is the absolute position and $i$ indexes the embedding dimension [src_012]. The wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$, chosen because the authors hypothesised that the model could learn to attend by relative offsets — for any fixed shift $k$, $PE(\text{pos}+k)$ is a linear function of $PE(\text{pos})$, so a head that wants to implement a relative offset has a simple target [src_012].

The 2017 paper also experimented with learned absolute positional embeddings, in which $PE$ is a trainable matrix of shape $T_{\max} \times D$ rather than a fixed sinusoidal table. The ablation in Table 3 row (E) reports nearly identical translation quality between sinusoidal and learned positional embeddings on the WMT 2014 English-German development set, and the authors retained the sinusoidal version because they expected it would extrapolate better to sequence lengths beyond the training range [src_012]. The encoder-only family that followed BERT adopted learned absolute positional embeddings as standard practice instead [src_002].

In modern LLMs both schemes have been displaced. Sinusoidal embeddings did not extrapolate to longer-than-training contexts as cleanly as Vaswani et al. hoped; learned absolute embeddings need to be re-trained or extrapolated whenever the context window grows. Both are now routinely replaced by relative or rotary encodings, with rotary position encoding (RoPE) being the dominant choice in current decoder-only LLMs [src_002]. Chapter 2 takes RoPE up in detail; for the purposes of this chapter it suffices to record that absolute positional encodings — sinusoidal and learned — were the 2017 baseline, and that the baseline has changed.

⚠️ Pitfall

Vaswani et al. retained sinusoidal embeddings expecting better extrapolation; the empirical record says the opposite. Neither sinusoidal nor learned absolute embeddings extrapolate cleanly to longer contexts, which is why modern LLMs replaced both with relative or rotary schemes.

🔗 Connection

Chapter 2 takes up rotary position encoding (RoPE) in detail, including its relation to relative-position attention and its extrapolation behaviour beyond the training context length.

9. Putting it together: encoder block, decoder block, encoder-decoder¶

The encoder of the 2017 Transformer is a stack of $L = 6$ identical layers. Each layer has two sub-layers in series: a multi-head self-attention sub-layer and a position-wise FFN sub-layer, each wrapped in a residual connection and a LayerNorm in the post-norm formulation [src_012]. All sub-layers and the input embedding produce vectors of dimension $D = 512$, so the residual sums are well-defined throughout the stack [src_012].

The decoder is also a stack of $L = 6$ identical layers, but each layer has three sub-layers rather than two. The first is a masked (causal) multi-head self-attention sub-layer, which attends only to positions up to and including the current one. The second is a multi-head encoder-decoder cross-attention sub-layer, in which the queries come from the previous decoder layer but the keys and values come from the output of the encoder stack — this is what lets every decoder position see the entire input sequence. The third is the position-wise FFN. As in the encoder, each sub-layer is wrapped in residual + LayerNorm [src_012].

The base 2017 model has $L = 6$, $D = 512$, $h = 8$, $d_k = d_v = 64$, and $d_{ff} = 2048$; the big model has $L = 6$, $D = 1024$, $h = 16$, and $d_{ff} = 4096$ [src_012]. The full encoder-decoder architecture is the substrate from which encoder-only models (BERT-style — drop the decoder, drop the masking, train with masked language modelling) and decoder-only models (GPT-style — drop the encoder, keep the masking, train with next-token prediction) are obtained by restriction [src_002]. Chapter 7 takes up these three families in their own right; here we record only that they share the same block-level skeleton.

🔗 Connection

The encoder-only, decoder-only, and encoder-decoder families share the per-block skeleton documented above; Chapter 7 takes them up in their own right, including the specific masking and training-objective choices that distinguish each family.

9.1 Reference implementation¶

A reference Python implementation of every component named above — scaled dot-product attention, multi-head attention with the standard projection layout, the position-wise FFN, residual+LayerNorm wrapping, sinusoidal positional encoding, the full encoder block and decoder block, and the encoder-decoder loop — appears in Raschka's open notebooks alongside this book [src_010].

10. Worked example: the 2017 base model¶

To anchor the abstract notation in concrete numbers, we trace the parameter and compute budget of the base 2017 Transformer at the granularity of one block.

With $D = 512$, $h = 8$, $d_h = 64$, the multi-head attention sub-layer holds four $D \times D$ matrices — three for $W_Q, W_K, W_V$ (each implemented as the concatenation of $h$ per-head $D \times d_h$ projections, totalling $D \times D$) and one for the output projection $W_O$ — for a total of $4 D^2 = 4 \cdot 512^2 \approx 1.05 \times 10^6$ parameters [src_012]. The FFN sub-layer holds two matrices of shapes $D \times d_{ff}$ and $d_{ff} \times D$ with $d_{ff} = 2048$, so its parameter count is $2 D d_{ff} = 2 \cdot 512 \cdot 2048 \approx 2.10 \times 10^6$ parameters [src_012]. The FFN therefore carries about twice the parameter budget of the attention sub-layer at $r = 4$ — the two-thirds-in-the-FFN ratio Section 5 derived in symbolic form [src_002].

Stacked across $L = 6$ encoder layers and $L = 6$ decoder layers (with the decoder having an extra cross-attention sub-layer per layer), and including the embedding tables, the base model totals roughly $65 \times 10^6$ parameters — order-of-magnitude consistent with the "base" line of Vaswani et al. Table 3 [src_012]. The compute budget per training step is dominated, at the modest sequence lengths used for translation, by the FFN matrix multiplications; at long sequence lengths the $\mathcal{O}(T^2 D)$ self-attention term overtakes the $\mathcal{O}(T D^2)$ FFN term, which is the empirical pressure that motivates the efficient-attention work of Chapter 4 [src_012].

🔗 Connection

The asymptotic crossover where attention's $\mathcal{O}(T^2 D)$ overtakes the FFN's $\mathcal{O}(T D^2)$ is exactly the regime Chapter 4 addresses, with FlashAttention, grouped-query and multi-query attention, and sliding-window patterns.

11. Teaser: what's next¶

This chapter establishes the 2017 reference architecture as the baseline against which every later chapter of this book is best read. Chapter 2 replaces sinusoidal and learned absolute positional encodings with rotary position encoding (RoPE), which injects relative-position information through a rotation in query-key space rather than an additive embedding [src_002]. Chapter 3 replaces LayerNorm with RMSNorm and ReLU+linear with SwiGLU, in both cases trading away parameters and computation that empirical evidence suggests are not load-bearing [src_002]. Chapter 4 takes up FlashAttention, grouped-query attention (GQA), multi-query attention (MQA), and sliding-window patterns — implementations and approximations that change the cost of self-attention without changing what it computes [src_002]. Together these three chapters describe the modern decoder-only Transformer; the reader who wants a complete map should treat the present chapter as the origin and Chapter 8 as the destination.

🔄 Recap

Complete: every block in a Transformer stack ingests and emits a tensor of shape ____.
Explain why the FFN sub-layer, not the self-attention sub-layer, carries the bulk of a block's parameter budget at $r = 4$.
Compare the encoder block (two sub-layers) with the decoder block (three sub-layers): which extra sub-layer does the decoder carry, and what does it consume?
Predict: at long sequence lengths $T$, which sub-layer's compute begins to dominate, and which chapter takes up the remedy?

References¶

src_001 — Christopher M. Bishop and Hugh Bishop. Deep Learning: Foundations and Concepts. Springer, 2024. https://www.bishopbook.com/
src_002 — Tong Xiao and Jingbo Zhu. Foundations of Large Language Models. arXiv:2501.09223v2, 2025. https://arxiv.org/pdf/2501.09223
src_010 — Sebastian Raschka. Build a Large Language Model (From Scratch). Manning, 2024. https://github.com/rasbt/LLMs-from-scratch
src_012 — Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In Advances in Neural Information Processing Systems (NIPS 2017), 2017. https://arxiv.org/pdf/1706.03762
src_029 — Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288, 2023. https://arxiv.org/pdf/2307.09288
src_049 — Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On Layer Normalization in the Transformer Architecture. In Proceedings of the 37^th International Conference on Machine Learning (ICML 2020), 2020. https://arxiv.org/pdf/2002.04745

The Transformer Block Revisited¶

1. Why revisit the Transformer¶

2. Notation and setup¶

3. Scaled dot-product self-attention¶

3.1 Why the \(1/\sqrt{d_k}\) scale¶

3.2 Causal masking¶

4. Multi-head attention¶

5. Position-wise feed-forward network¶

6. Residual connections, LayerNorm, and the residual stream¶

6.1 The residual stream as a viewpoint¶

7. Pre-norm versus post-norm¶

8. Positional encoding: sinusoidal and learned¶

9. Putting it together: encoder block, decoder block, encoder-decoder¶

9.1 Reference implementation¶

10. Worked example: the 2017 base model¶

11. Teaser: what's next¶

References¶