Encoder, Decoder, and Encoder-Decoder¶
The original Transformer was an encoder-decoder. Within five years of its publication the field had split that single architecture into three families — encoder-only, decoder-only, and encoder-decoder — and one of those branches, the decoder-only line, came to dominate frontier-scale language modelling. This chapter walks the trichotomy: it sets up the classical taxonomy, follows the encoder-only branch from BERT in 2018 to ModernBERT in 2024, situates T5 and the text-to-text framing on the encoder-decoder side, and closes with the question we hand to Chapter 8 — why decoder-only won.
The argument has two threads. The first is taxonomic: each architectural family pairs naturally with a different self-supervised objective, and the choice of objective shapes what the resulting model is good at. The second is historical: BERT was designed in 2018 with the components that were standard at the time — learnable absolute positional embeddings, post-LayerNorm, GeLU, WordPiece, full bidirectional attention. In the years that followed, every one of those components was either superseded or refined elsewhere in the architecture stack. ModernBERT is what an encoder-only Transformer looks like after eight years of accumulated component-level upgrades have been folded back in.
Three architectural paradigms¶
The taxonomy is now standard. Encoder-only models — BERT, RoBERTa, ModernBERT — produce contextual representations of an input sequence in a single bidirectional pass; every position attends to every other position, and the output is a sequence of vectors meant to feed downstream classifiers or retrievers [src_001, src_003]. Decoder-only models — the GPT line and its successors — produce a distribution over the next token at every position under a causal mask, so each position attends only to itself and earlier tokens; we cover this family in detail in Chapter 8 [src_002, src_010]. Encoder-decoder models — the original Vaswani Transformer, T5 — combine the two: an encoder produces contextual representations of the input, and a decoder generates an output sequence autoregressively while cross-attending into the encoder's output [src_001, src_002, src_003]. The BERT paper itself names the two halves with what became their canonical labels — a "Transformer encoder" for the bidirectional version and a "Transformer decoder" for the left-context-only version [src_014].
🔗 Connection
The bidirectional / causal contrast pivots on the attention mask
machinery introduced in Chapter 1.
cross-attention is the encoder-decoder coupling whose mechanics
we cover in §"Encoder-decoder: the T5 path" below.
🎯 Intuition
The trichotomy is easiest to read off what each family can see at every position:
- Encoder-only: every position sees every other position (full bidirectional context).
- Decoder-only: each position sees only itself and earlier positions (lower-triangular causal mask).
- Encoder-decoder: an encoder pass reads the input bidirectionally, then the decoder generates left-to-right while looking back at the encoder's output through cross-attention.
Visibility-at-each-position is what determines which loss the architecture's attention pattern actually makes tractable.
Each paradigm aligns with a different self-supervised objective. Encoder-only models pair with denoising objectives — predict the masked tokens given their bidirectional context. Decoder-only models pair with left-to-right next-token prediction, which is exactly what the causal mask makes tractable. Encoder-decoder models pair with sequence-to- sequence denoising — corrupt the input, generate the original from the encoder's representation. The architecture and the objective are not independent choices; one implies the other.
For the rest of this chapter we use \(T\) for sequence length, \(B\) for batch size, \(D\) for hidden dimension, \(h\) for the number of attention heads, \(d_h = D / h\) for per-head dimension, \(V\) for vocabulary size, and \(L\) for number of Transformer layers. These are the same symbols used in Chapter 1.
💡 Key result
Each architecture pairs with the objective whose loss its attention pattern actually makes tractable.
BERT: the bidirectional encoder¶
BERT is a stack of \(L\) Transformer encoder layers — bidirectional self-attention plus a position-wise FFN — trained to fill in masked tokens. The architecture is conventional; the contribution that mattered was the pretraining objective, which made bidirectional context tractable for self-supervised learning at scale [src_014].
🔗 Connection
The "bidirectional self-attention" the BERT block runs on is the full attention pattern (no mask) introduced in Chapter 1; the contrast with the lower-triangular causal mask used by the GPT line below pivots on the same machinery.
The input representation is the sum of three learned embeddings per token. For a token at position \(i\) in segment \(s \in \{A, B\}\), the input embedding is
where \(\mathbf{w}_{x_i}\) is the WordPiece token embedding (WordPiece is a subword tokeniser — a predecessor of BPE — that splits rare words into smaller pieces drawn from a learned vocabulary), \(\mathbf{s}_s\) is the segment embedding (one of two learned vectors marking which sentence the token belongs to in a pair), and \(\mathbf{p}_i\) is the learnable absolute positional embedding — a separate trainable vector for each position up to a fixed maximum length [src_014]. We will return to the positional term shortly because that single design choice is the most visible thing the field replaced when it moved on. BERT's vocabulary is WordPiece with 30,000 tokens, and the maximum sequence length is 512 [src_014].
Two reserved tokens give the format its shape. The first token of every
sequence is [CLS], whose final hidden state is taken as the aggregate
representation of the input for classification tasks; sentence pairs
are joined with a [SEP] token and distinguished by the segment
embedding [src_014].
The first pretraining objective is Masked Language Modeling (MLM). At
training time, 15% of WordPiece tokens are selected for masking. Of
those selected tokens, 80% are replaced with the special [MASK]
token, 10% are replaced with a random token from the vocabulary, and
the remaining 10% are left unchanged; the model is trained to predict
the original token at every selected position [src_014]. Writing \(\mathcal{M}\)
for the set of masked positions and \(x_{-\mathcal{M}}\) for the input with
those positions replaced according to the 80/10/10 rule, the MLM loss is
🤔 Pause and reflect
The 80/10/10 split is unusual. Before reading on, can you say what
would go wrong if BERT just replaced every selected position with
[MASK]?
(Do not look ahead — write the answer down or say it out loud.)
🎯 Intuition
The model sees the sentence with holes punched in it, and is graded
on how well it fills each hole. The 80/10/10 split is what determines
what those holes look like: sometimes a literal [MASK], sometimes
a random wrong word, sometimes the original token left in place.
Average the per-hole log-loss.
where the model produces \(P(x_i \mid x_{-\mathcal{M}})\) at each masked
position by passing the final-layer hidden state through the token-
embedding matrix transposed and a softmax over the vocabulary
[src_014]. The 80/10/10 split is not just a detail. Replacing every
masked position with [MASK] would create a train/test mismatch, since
[MASK] never appears at fine-tuning or inference time; the random and
unchanged fractions force the model to keep producing useful
representations for tokens that look unmasked.
🔗 Connection
MLM has a clean cross-modal analogue: the masked-autoencoder (MAE) family for vision dissected in Chapter 6 replaces masked tokens with masked image patches and reconstructs the held- out content from the visible context — the same denoising idea transposed from text to images.
The second pretraining objective is Next Sentence Prediction (NSP). For
50% of training pairs, sentence B genuinely follows sentence A in the
source corpus; for the other 50%, sentence B is sampled at random from
elsewhere. A binary classifier on the final hidden state of [CLS]
predicts whether B follows A [src_014]. The original BERT paper reports
that NSP helps on Question Answering and Natural Language Inference and
includes it in the recommended pretraining recipe [src_014].
🔄 Recap
- Complete the equation. What three learnable embeddings are summed at each position to form BERT's input embedding \(\mathbf{e}_i\), and what does each one represent?
- Explain why. Why does the 80/10/10 mask-replacement rule
replace some selected tokens with random words and leave others
unchanged, instead of replacing every selected position with
[MASK]? - Predict. Suppose a successor model dropped the NSP objective entirely. Which of BERT's design choices would you expect to become wasted machinery, and which would still pull their weight?
The NSP folklore¶
A year later, RoBERTa took BERT's recipe apart hyperparameter by
hyperparameter and asked which design choices actually paid for
themselves. The headline finding for our purposes: NSP did not. In the
RoBERTa ablation, configurations that drop NSP and feed the model
contiguous sentences sampled either from a single document
(DOC-SENTENCES) or across document boundaries (FULL-SENTENCES)
match or slightly improve downstream task scores compared to the
NSP-retained baseline [src_015]. The clinical reading is that BERT's
positive NSP result was confounded by the input format used in the NSP
ablation, not by the loss term itself.
RoBERTa goes further. The same encoder architecture, trained longer, on more data, with bigger batches and dynamic masking (re-sampling the masked positions on every epoch, in contrast with BERT's preprocessing- time pre-computed masks) — and without NSP — matches or exceeds the post-BERT models that had been published in the intervening year [src_015]. The takeaway is dual. Architecturally, NSP was a load-bearing claim that did not hold up. Methodologically, the original BERT was significantly undertrained, and a substantial fraction of the gap between BERT-2018 and "BERT successors" was attributable to compute and data rather than to architectural innovation. Both lessons recur across the rest of the book.
🔗 Connection
"BERT was significantly undertrained" is exactly the empirical content of the compute-optimal allocation argument that Chapter 9 develops as the Chinchilla scaling laws.
⚠️ Pitfall
BERT's positive NSP result was confounded by the input-format choice in the NSP ablation (segment-pair vs. document-contiguous), not by the loss term itself. Reading the original paper as "NSP helps" without reading RoBERTa's ablation reverses the lesson.
The learnable-absolute folklore¶
The same pattern shows up in BERT's positional encoding, on a longer fuse. Learnable absolute positional embeddings — a separate trainable vector for each position up to a fixed maximum length — were a reasonable default in 2018 and were inherited from earlier sequence models. They have two practical limits that became binding as context windows grew. First, they are tied to a fixed maximum length: a model trained with 512 positional vectors has nothing meaningful to put at position 513. Second, they do not extrapolate well: the embedding at position 500 and at position 501 are independent learned vectors, with no built-in inductive bias that they should encode similar relative information [src_002, src_016].
The replacement, covered in Chapter 2, is RoPE — rotary position encoding — which encodes position as a rotation applied to query and key vectors in pairs of dimensions. RoPE has the desirable property that the dot product between rotated queries and keys depends only on their relative offset, which gives a built-in relative-position bias without an extra additive term [src_002, src_016]. ALiBi, a competing scheme that adds a learned linear penalty to attention scores as a function of distance, occupies the same niche. Both schemes generalize to lengths beyond what was seen at training time, with appropriate choices of base frequency or extension scheme [src_002].
🔗 Connection
Chapter 2 covers RoPE in detail — including why the base frequency sets the wavelength of the slowest rotation, why ALiBi sits in the same niche, and how the YaRN / NTK-aware "extension schemes" let the same encoding generalise past its training-time context length.
ModernBERT: the BERT that aged well¶
ModernBERT, released in late 2024, is the encoder-only architecture rebuilt with the components that became standard in the intervening years [src_016]. The headline claim, made explicit in the ModernBERT paper, is not that the BERT design became bad. It is that the field's defaults moved, and an encoder-only model that adopts those defaults is faster, longer-context, and stronger on downstream evaluations than the original.
The decision matrix is the cleanest way to read the design space.
| Dimension | BERT (2018) | ModernBERT (2024) |
|---|---|---|
| Positional encoding | Learnable absolute (added to input embeddings) | RoPE applied to Q/K, with separate base frequencies for global and local layers |
| Normalization | Post-LayerNorm | Pre-LayerNorm; bias terms removed; extra LayerNorm after the embedding layer |
| FFN activation | GeLU | GeGLU (a Gated-Linear-Unit variant of GeLU) |
| Pretraining objective | MLM (15% mask) + NSP | MLM only, with a higher 30% mask rate |
| Attention pattern | Full bidirectional, every layer | Alternating: local sliding-window attention (128 tokens) most layers; global attention every third layer |
| Maximum sequence length | 512 tokens | 8192 tokens natively (extended from 1024 during training) |
| Tokenizer | WordPiece, \(V = 30{,}000\) | Modified OLMo BPE, \(V = 50{,}368\) (multiple of 64 for GPU tiling) |
| Bias terms | Present in linear and LayerNorm | Removed from linear layers (except final decoder linear) and from LayerNorms |
| Pretraining data | Roughly 3.3B tokens (BooksCorpus + Wikipedia) | 2T tokens of mixed-domain text including code |
| Attention implementation | Pre-FlashAttention era | FlashAttention-3 (global), FlashAttention-2 (local), with unpadding |
References for the BERT column: [src_014]. References for the ModernBERT column: [src_016]. Unpadding in the last row is the FlashAttention-era trick of skipping computation on padding tokens by reshaping a padded batch into a single concatenated sequence, with the original shape recovered at output time.
Per-component commentary¶
A few of these deserve brief comment because they encode broader architectural shifts in the field.
The shift from post-LayerNorm to pre-LayerNorm is a stability story. With post-LN, the LayerNorm is applied after the residual add, and at depth this configuration is famously fragile during training. Pre-LN applies the normalization before each sub-layer's nonlinear transformation, leaving the residual stream untouched, which is more stable for deep stacks [src_016]. We treat this in detail in Chapter 3.
🔗 Connection
Chapter 3 treats the post-LN → pre-LN shift, the GeLU → GeGLU upgrade, and the gating-vs-non-gating contrast (with the GLU formula \(\mathrm{GLU}(x) = (xW_1 + b_1) \odot \sigma(xW_2 + b_2)\)) head-on; the brief comments here only signal which dimensions Ch.3 covers.
The shift from GeLU to GeGLU is the same gating-vs-non-gating shift that appears on the decoder side as SwiGLU; the FFN gains a learned gate that multiplies the activation, at the cost of expanding the projection by a factor (managed by tuning the FFN intermediate dimension) [src_016]. The ModernBERT paper picks GeGLU specifically; SwiGLU is the typical choice for the modern decoder-only family that Chapter 8 dissects.
The alternating local-global attention pattern is the substantive architectural delta. Most ModernBERT layers use local sliding-window attention with a 128-token window; every third layer is a global- attention layer that mixes information across the full sequence. This keeps attention compute closer to linear in sequence length while preserving the long-range mixing that makes 8192-token contexts useful [src_016]. The two attention modes use different RoPE base frequencies because their effective relative-position ranges differ.
The mask rate change — 15% in BERT, 30% in ModernBERT — is small in the table but worth noticing. The higher mask rate gives more supervision per training sequence, and the ModernBERT paper finds it preferable empirically; the original 15% rate was itself a hyperparameter choice rather than a derived optimum [src_016, src_014]. NSP is dropped, as foreshadowed by RoBERTa.
The framing matters. ModernBERT did not invent any of these components; it consolidated improvements that had been validated elsewhere — in RoBERTa, in the decoder-only line, in FlashAttention work — and showed that they compose into a Pareto improvement over the 2018 encoder design (better on every axis simultaneously — speed, context length, and downstream accuracy — with no trade-off) [src_016].
💡 Key result
ModernBERT did not invent any single component; it composed eight years of decoder-side and FlashAttention-era component upgrades into the encoder-only stack and showed they Pareto-improve the 2018 design.
🔄 Recap
- Compare. On positional encoding and on attention pattern, what changed between BERT-2018 and ModernBERT-2024, and what stayed the same?
- Explain why. Why is the post-LayerNorm → pre-LayerNorm shift a stability story, and why does it matter more at depth?
- Predict. Suppose ModernBERT replaced its alternating local- global pattern with all-local sliding-window attention everywhere. Which of its headline benefits would you expect to lose first?
Encoder-decoder: the T5 path¶
The third branch is encoder-decoder, the architecture of the original Transformer. T5 — Text-to-Text Transfer Transformer — is its 2020 canonical instantiation [src_002, src_003]. Two ideas matter.
The first is a unified text-to-text framing. T5 treats every NLP task — classification, question answering, translation, summarization — as a text-to-text problem. The input is a textual prompt that includes a task prefix; the output is a textual answer; the same encoder-decoder model handles all of them [src_003]. This framing turns out to anticipate the prompt-engineered, single-model paradigm that decoder-only models later took to a logical conclusion at scale, but T5 implements it with an encoder-decoder backbone.
The second is the span corruption pretraining objective. Rather than masking individual tokens (BERT-style) or predicting the next token (GPT-style), T5 masks contiguous spans of input — average length around three tokens — replacing each masked span with a single sentinel token. The decoder is trained to generate a sequence that lists each sentinel followed by the original tokens of the corresponding span [src_003, src_002]. The encoder sees a corrupted input; the decoder reconstructs what was removed. This objective sits between the BERT and GPT extremes: it is denoising like BERT, but it asks for autoregressive generation like GPT.
🤔 Pause and reflect
Take the input the cat sat on the mat. Suppose the spans selected for masking are cat (length 1) and on the (length 2). Before reading the worked example below, can you write out what the encoder input and the decoder target should look like, including where the sentinel tokens go? (Do not look ahead — write it down first.)
🎯 Intuition
A worked example makes the transformation concrete. Take the input the cat sat on the mat. Suppose two spans are selected: sat (one token) and the mat (two tokens). Each masked span is replaced by a fresh sentinel:
- Encoder input: the cat
<X>on<Y> - Decoder target:
<X>sat<Y>the mat<Z>
The decoder generates the sentinel-prefixed reconstruction
autoregressively, and the trailing <Z> marks end-of-sequence.
Sentinels are what let one variable-length output describe many
variable-length holes.
The honest assessment in 2026 is that T5 was excellent in 2020 and the field still migrated away from it. By 2022, decoder-only models with in-context learning had effectively subsumed most of T5's use cases — translation, summarization, classification — without needing the encoder-decoder pretraining apparatus. The inference path is also simpler in decoder-only: a single causal-attention KV-cache, no encoder pass to manage separately. We do not relitigate this here; the next section sketches the why in three lines, and Chapter 8 fills in the architectural details.
🔗 Connection
cross-attention — the encoder-decoder coupling first invoked in
§"Three architectural paradigms" — is the mechanism whose
mechanics the span-corruption description above relies on; the
KV-cache referenced in the closing sentence is dissected in
Chapter 4,
where it is the central memory object of autoregressive decoding.
Why decoder-only won¶
Three reasons, all of which emerged through the early 2020s, explain the migration to decoder-only as the default for general-purpose language modelling. The empirical case for the architectural choice was made in a controlled head-to-head comparison by Wang et al. (2022), who crossed three architectures (causal decoder-only, non-causal decoder-only — a prefix-LM with bidirectional attention over a prefix and causal-attention generation tail, covered as a variant in Chapter 8 — and encoder-decoder) with two pretraining objectives (autoregressive language modelling and masked language modelling / span corruption); after pure unsupervised pretraining, the causal decoder-only + autoregressive-LM combination produced the strongest zero-shot generalisation, which is the empirical anchor beneath the GPT-style architectural bet [src_051].
First, zero-shot prompting and in-context learning emerged as properties of decoder-only models scaled past a critical threshold. The GPT-3 demonstration showed that a sufficiently large decoder-only model trained purely with next-token prediction could perform a wide range of tasks at inference time conditioned on a few examples in the prompt, without parameter updates [src_037]. We treat this in Chapter 8 because it is the qualitative behaviour that makes the modern decoder- only stack worth assembling.
🔗 Connection
Chapter 8 dissects the modern decoder-only stack — including how zero-shot prompting and in-context learning emerge with scale, and the per-component successors (RMSNorm, SwiGLU, GQA, KV-cache) that the closing §"Looking ahead" foreshadows.
Second, one-stage pretraining. Decoder-only models are trained with a single causal language modelling objective on a single autoregressive forward pass; encoder-decoder pretraining requires choreographing two networks, an explicit corruption-and-reconstruction objective, and a cross-attention pathway. At constant compute the simpler recipe scales more cleanly.
Third, operational simplicity at inference time. A decoder-only model serves a prompt by streaming tokens out of a single causal stack with a single KV-cache; no encoder pass to manage, no separate decoder state beyond the KV-cache itself. We unpack the inference pipeline in Chapter 8.
💡 Key result
The Wang et al. controlled comparison, plus zero-shot/ICL emergence, plus operational simplicity at inference, together explain why the decoder-only causal-LM combination became the frontier default — but only for general-purpose generation.
The framing should not be triumphalist. Decoder-only is the right default at frontier scale for general-purpose generation; it is not a universal verdict. Encoder-only models remain genuinely competitive — often dominant — for retrieval, classification, and embedding tasks where bidirectional context is the point and generation is not needed, and where latency and serving cost matter [src_016]. ModernBERT was designed precisely for those tasks, and it competes against encoder- only baselines, not against frontier decoder-only language models. Encoder-decoder models retain a niche in translation and structured generation tasks where the input and output are clearly separated and encoder bidirectionality is helpful.
🔄 Recap
- Complete. Of the three reasons given for decoder-only's rise (zero-shot/ICL, one-stage pretraining, inference simplicity), which one does the Wang et al. controlled comparison actually anchor empirically?
- Compare. Name one task where encoder-only is still the right default and one task where encoder-decoder retains a niche, with a one-line reason for each.
- Predict. Suppose a 2024-era controlled comparison were re-run with current-recipe pretraining (much larger compute, post-2022 data). Which of the three reasons would you expect to gain force and which would you expect to weaken?
Looking ahead¶
Chapter 8 takes the decoder-only branch and dissects what a 2024-2025 frontier model — Llama-3, DeepSeek-V3, the Qwen and Gemma families — looks like inside. The components introduced in passing here (RoPE in §7.4, GeGLU and pre-LayerNorm in the ModernBERT table, alternating attention patterns) reappear there in their decoder-side incarnations: RoPE on Q and K, RMSNorm rather than LayerNorm, SwiGLU rather than GeGLU, GQA for the attention variant, and an inference pipeline that foregrounds the KV-cache as the central memory object. The same historical motion that produced ModernBERT — keep the original block diagram, replace each component with its eight-years-on successor — is the motion that produced the modern decoder-only LLM. The next chapter shows what arrives when you do that on the GPT line.
References¶
- src_001 — Bishop, C. M. & Bishop, H. Deep Learning: Foundations and Concepts. Springer, 2024. https://www.bishopbook.com/
- src_002 — Xiao, T. & Zhu, J. Foundations of Large Language Models. arXiv:2501.09223v2, 2025. https://arxiv.org/pdf/2501.09223
- src_003 — Jurafsky, D. & Martin, J. H. Speech and Language Processing, 3rd ed. (rolling, January 2026 release). https://web.stanford.edu/~jurafsky/slp3/
- src_010 — Raschka, S. Build a Large Language Model (From Scratch). Manning, 2024. https://github.com/rasbt/LLMs-from-scratch
- src_014 — Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805, 2018. https://arxiv.org/pdf/1810.04805
- src_015 — Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. & Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692,
- https://arxiv.org/pdf/1907.11692
- src_016 — Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., Cooper, N., Adams, G., Howard, J. & Poli, I. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference (ModernBERT). arXiv:2412.13663, 2024. https://arxiv.org/pdf/2412.13663
- src_037 — Brown, T. et al. Language Models are Few-Shot Learners. arXiv:2005.14165, 2020. (Cited via Chapter 8; introduced here for the in-context-learning argument.)
- src_051 — Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, and Colin Raffel. What Language Model Architecture and Pretraining Objective Works Best for Zero-Shot Generalization? In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), 2022. https://arxiv.org/pdf/2204.05832