Vision Transformers¶

1. From sequence-to-sequence to image-to-sequence¶

Chapter 1 framed the Transformer as a sequence operator. Given $T$ tokens, each represented by a $D$-dimensional vector arranged into an activation tensor of shape $(B, T, D)$, the encoder stack returns a sequence of the same shape, with information mixed across positions by self-attention and across feature channels by the position-wise feed-forward network. Nothing in that architecture is specific to language; the only commitments are that the input arrives as a one-dimensional sequence of equal-dimensional vectors and that some position information is supplied to break the layer's permutation equivariance.

🔗 Connection

Chapter 1 develops the Transformer block as a sequence operator; this chapter takes that block as a given and builds the patch-embedding front end that feeds it.

The Vision Transformer of Dosovitskiy et al. is the minimal cash-out of that observation for images [src_013]. Take a standard NLP-style Transformer encoder, change nothing inside the encoder block, and prepend a small front end whose entire job is to turn an image into a sequence of vectors that the encoder can consume. There is no convolutional stem, no pyramid of feature maps, no spatial pooling between blocks; the encoder is the encoder of Chapter 1, and the inductive biases that classical convolutional networks bake into every layer (locality, translation equivariance, hierarchical feature aggregation) are deliberately not built in. Convolutions reuse the same small spatial filter at every position (locality and translation equivariance), and stacked strided convolutions or pooling layers compress fine-grained features into coarser ones (hierarchical aggregation); a self-attention layer does none of these by construction. The ViT paper's central empirical claim is that this minimal port works, given enough data: pretrained at sufficient scale and transferred to image recognition benchmarks, ViT matches or exceeds the strongest residual-network baselines [src_013].

That this is the right framing of ViT — a port of an NLP encoder to vision rather than a new vision architecture — is also how the canonical 2024 textbooks position it. Bishop and Bishop discuss vision Transformers as an instance of the same Transformer machinery covered in their language chapters, and Torralba, Isola and Freeman's Foundations of Computer Vision places ViT alongside the classical convolutional stack as an alternative backbone with a different bias profile [src_001, src_008]. Courant et al.'s 2023 chapter on visual Transformers takes the same view, treating attention basics, ViT, and small-data extensions as one continuous lineage rather than separate developments [src_009].

This chapter follows that framing. Section 2 builds the patch-embedding front end and shows the equivalence with a single Conv2d. Section 3 describes the [CLS] token and the position embedding. Section 4 works the token-count arithmetic explicitly. Section 5 collapses the encoder to a single sentence, since Chapter 1 already covered the substance. Sections 6 through 8 deal with the data-and-compute story that originally distinguished ViT from convolutional baselines, the DeiT shortcut that softened the data requirement, and Swin Transformer V2 as the most prominent re-introduction of convolutional priors. Section 9 zooms out to ViT in 2026, and Section 10 closes by pointing forward to self-supervised vision in Chapter 6.

2. Patch embedding¶

Let an image have shape $H \times W \times C$, with $C = 3$ for standard RGB inputs. Choose a patch size $P$ that divides both $H$ and $W$, and reshape the image into a sequence of $N = (H/P) \cdot (W/P) = HW/P^2$ flattened patches, each of shape $P \times P \times C$ flattened into a vector of length $P^2 \cdot C$. A single trainable linear projection $E \in \mathbb{R}^{(P^2 \cdot C) \times D}$ maps each patch to a $D$-dimensional token. Equation (1) of the ViT paper captures exactly this construction [src_013]:

🎯 Intuition

Each $16 \times 16 \times 3$ image patch is flattened to a $768$-element vector and passed through one shared linear map $E$ to land in a $768$-dimensional token space. The equation that follows says only that we stack these patch tokens into a sequence, prepend a learnable [CLS] vector, and add one learnable position vector per slot. The convolution-equivalence and the code block below operationalise this picture; the equation does not yet do any work the prose has not already done.

\[z_0 = [x_{\text{class}}; \, x_p^1 E; \, x_p^2 E; \, \ldots; \, x_p^N E] + E_{\text{pos}}, \qquad E \in \mathbb{R}^{(P^2 \cdot C) \times D}, \quad E_{\text{pos}} \in \mathbb{R}^{(N+1) \times D}.\]

The $x_{\text{class}}$ term and the $E_{\text{pos}}$ summand are the [CLS] token and the position embedding, deferred to Section 3. The substantive content here is the construction of the $N$ patch tokens.

The key implementation observation is that the flatten-then-linear-project of $P \times P \times C$ patches is mathematically equivalent to a single 2D convolution with kernel size $P$ and stride $P$, mapping $C$ input channels to $D$ output channels [src_001, src_008, src_013]. With kernel size equal to stride, the convolution sees each patch exactly once, applies the same shared linear map to all patches, and emits a $D$-channel feature map of spatial shape $(H/P) \times (W/P)$. Flattening the spatial dimensions and transposing yields the desired $N \times D$ token sequence. The two formulations differ only in how they are written down; they realise the same parameter set and the same forward map.

This equivalence is convenient because Conv2d kernels are extremely well optimised on every modern GPU and TPU stack, while a manual unfold-then-linear pipeline is not. The canonical PyTorch implementation collapses to two lines:

import torch.nn as nn

# Patch embedding for ViT-B/16: P = 16, C = 3, D = 768
proj = nn.Conv2d(in_channels=3, out_channels=768, kernel_size=16, stride=16)

# image: (B, 3, 224, 224)
# proj(image):              (B, 768, 14, 14)
# .flatten(2):              (B, 768, 196)
# .transpose(1, 2):         (B, 196, 768)
tokens = proj(image).flatten(2).transpose(1, 2)

Reading this from the outside in: the convolution turns the image into a $(B, D, H/P, W/P)$ feature map; flatten(2) collapses the spatial axes into a single axis of length $N = HW/P^2$; transpose(1, 2) swaps the channel and sequence axes so the result has the standard $(B, N, D)$ layout. The same effect can be achieved with einops.rearrange(image, "b c (h p1) (w p2) -> b (h w) (p1 p2 c)") followed by an $\texttt{nn.Linear}(P^2 C, D)$ — this is the explicit linear-projection form, and it is closer to ViT eq. (1) read literally — but the convolution form is the one that actually ships in production codebases [src_013].

Figure fig_003 visualises the pipeline.

ViT patch embedding: a Conv2d with kernel=stride=P followed by flatten and transpose is equivalent to flatten-then-linear-project of P x P patches; the [CLS] token is prepended and a learnable position embedding is added per token.

The convolution-equivalence framing is also the natural pedagogical bridge from classical computer vision to ViT. Filter banks applied over a sliding spatial grid are exactly what convolutions implement; ViT's patch embedding is the special case in which the stride matches the kernel and there is no overlap [src_001, src_008]. From this perspective the only architecturally novel thing about the ViT front end is that the spatial grid is then collapsed into a one-dimensional token sequence and all subsequent processing is permutation-equivariant up to position embeddings.

3. CLS token and position embedding¶

Two tokens-worth of bookkeeping must be added to the patch sequence before it can enter the encoder.

The first is the classification token, written [class] or [CLS]. Following BERT's NLP convention, ViT prepends a single learnable embedding to the patch sequence, so that the input has $N + 1$ tokens whose positions run from $0$ (the [CLS] token) to $N$ (the last patch) [src_013].

🔗 Connection

The [CLS]-token convention is borrowed from BERT (Chapter 7), where a single register-style token is prepended to the input sequence and its final hidden state is read as a sentence-level representation. The hidden state at position $0$ at the output of the encoder serves as the image representation, and a small classification head — a one-hidden-layer MLP at pretraining time, a single linear layer at fine-tuning time — is attached to that hidden state to produce class logits [src_013]. The [CLS] token has no spatial meaning; it is a register that the self-attention layers learn to populate with whatever globally-pooled information the classification head needs.

The second is the position embedding $E_{\text{pos}} \in \mathbb{R}^{(N+1) \times D}$, added pointwise to the $N+1$ token sequence as in eq. (1) above [src_013]. ViT uses standard learnable 1-D absolute position embeddings: each of the $N+1$ positions is associated with a freely learnable $D$-dimensional vector, initialised randomly and trained jointly with the rest of the model. The choice of 1-D learnable rather than 2-D-aware position embeddings (for example sin-cos pairs along $H$ and $W$, or factorised row and column embeddings) is one of the small surprises of the original paper: Dosovitskiy et al.'s ablation in their Appendix D.4 reports that 2-D-aware variants give negligible gain over the 1-D baseline in their setting, so the simpler scheme was kept [src_013]. The paper's interpretation, supported by their position-similarity visualisation, is that the model learns 2-D image topology from the patches themselves: trained position embeddings end up with row-column structure visible in their cosine-similarity matrix, and closer patches develop more similar embeddings [src_013].

This is one of the few places where the modern decoder-only LLM stack and the classic ViT stack made different choices and stuck with them. Modern language models almost universally now use rotary position encoding (the topic of Chapter 2), which is cheaper to extend to longer contexts and behaves better at extrapolation. Classic ViT and most of its 2020-2022 descendants stayed with learnable absolute position embeddings, which are a fine match for fixed input resolutions and a poor match for variable resolutions. A practical consequence is that ViT fine-tuning at higher resolution requires a 2-D interpolation of the pretrained position embeddings according to their location in the original image; this is the only place where the 2-D structure of the image is manually re-injected into the model after training [src_013].

4. Token count arithmetic¶

The patch-embedding formula $N = HW/P^2$ has direct numerical consequences whenever a specific ViT variant is named. The most common configuration in the literature is ViT-B/16: a Base-sized model (12 layers, $D = 768$, 12 heads, 86M parameters per Table 1 of the ViT paper) with patch size $P = 16$, trained at the standard 224 x 224 input resolution [src_013].

Plugging the numbers in: $H = W = 224$, $P = 16$, so $N = (224/16)^2 = 14^2 = 196$ patches.

🤔 Pause and reflect

Before reading on, predict — for ViT-L/14 at the same $224 \times 224$ input, what is the patch count $N$ and the total token count $N + 1$? Then predict whether the per-block self-attention cost scales linearly or quadratically in the input side length. (Do not look ahead — write the answer down or say it out loud.)

Adding the [CLS] token brings the total token count to $N + 1 = 197$, which is the sequence length the encoder actually sees. The ViT paper itself states the formula $N = HW/P^2$ and notes the 14 x 14 grid (against the original 224 x 224 image) in its position-embedding ablation, but does not write the explicit "197 tokens" arithmetic in a single sentence; the worked example here makes it concrete [src_013].

This concrete sequence length is worth carrying in mind. ViT-B/16 at 224 x 224 sees a sequence of length 197, much shorter than the multi-thousand-token contexts familiar from language models, and the per-block self-attention cost — quadratic in token count — is correspondingly modest. ViT-L/16 has the same sequence length (the model dimension and depth grow, but $N$ is fixed by the image and patch sizes); ViT-L/14, the variant favoured by many self-supervised pretraining recipes, has $N = (224/14)^2 = 16^2 = 256$ patches and so 257 tokens. Going to higher resolutions explodes the count quickly: ViT-B/16 at $384 \times 384$ has $N = 24^2 = 576$ patches and 577 tokens, with a $3\times$ blow-up in attention cost relative to 224.

💡 Key result

ViT-B/16 at $224 \times 224$ sees $197$ tokens; the per-block self-attention cost grows quadratically in the number of patches, so doubling the input side multiplies the attention cost roughly fourfold.

5. The encoder is the encoder of Chapter 1¶

With patch tokens, [CLS] token and position embeddings in hand, the rest of ViT is exactly the Transformer encoder of Chapter 1 [src_013].

🔗 Connection

The encoder block reused here — pre-LN ordering, multi-head self-attention, two-layer MLP, two residual connections — is defined in Chapter 1; the rest of this chapter assumes that as starting context.

Equations (2)–(4) of the ViT paper are reproduced below, in the paper's own notation, for completeness:

\[z'_\ell = \mathrm{MSA}(\mathrm{LN}(z_{\ell-1})) + z_{\ell-1}, \qquad \ell = 1, \ldots, L,$$ $$z_\ell = \mathrm{MLP}(\mathrm{LN}(z'_\ell)) + z'_\ell, \qquad \ell = 1, \ldots, L,$$ $$y = \mathrm{LN}(z_0^L).\]

🎯 Intuition

The 2020 ViT block and the 2024 modern-ViT block differ only in their drop-in component choices:

Component	2020 ViT	2024 modern ViT (Llama-style)
Norm	LayerNorm, pre-LN	RMSNorm, pre-LN
FFN nonlinearity	GELU MLP	SwiGLU FFN
Position	Learnable absolute (1-D)	(Optional) RoPE-2D / axial RoPE

The encoder block is unchanged; only the three components above are swapped, exactly as the language-model side evolved from GPT-2 to Llama.

Reading these in book notation: each block applies a LayerNorm, a multi-head self-attention, and a residual connection; then another LayerNorm, a two-layer MLP with a GELU non-linearity, and another residual connection [src_013]. LayerNorm is placed before each sub-layer rather than after, the now-standard pre-LN configuration that Chapter 1 introduced as a stability fix over the post-LN of the original 2017 Transformer. The classification head reads off $z_0^L$, the [CLS] token's hidden state at the output of the final block, after a final LayerNorm.

🤔 Pause and reflect

Why did pre-LN replace post-LN in the modern Transformer block? What stability problem did the post-LN architecture exhibit at depth that pre-LN solves, and how does the residual stream pass through each formulation differently? (Do not look ahead — recall the answer from Chapter 1, or check the cross-reference below.)

Two consequences follow. First, all of the modern-component upgrades that Chapters 2–4 cover — RoPE, RMSNorm, SwiGLU/GeGLU, FlashAttention — are drop-in replacements at the ViT-block level just as they are at the language-model-block level. Modern ViT variants in 2026 routinely swap LayerNorm for RMSNorm and the GELU MLP for a SwiGLU FFN; the encoder block remains a residual-stream Transformer block, and the difference from the 2020 ViT is exactly the difference between a 2020 GPT-2 block and a 2024 Llama block. Second, all the analysis tools that apply to language Transformers (residual-stream linearity, attention-pattern visualisation, mean attention distance) also apply to ViT, and Section 4.5 of the original paper exploits the second of these explicitly: some heads attend globally even at the lowest layers, while others stay highly localised, and the localised low-layer heads behave like the early convolutional layers of a CNN [src_013].

🔄 Recap

Complete: Write down the token count $N + 1$ for ViT-B/16 at $384 \times 384$ input resolution.
Explain: Why does ViT use 1-D learnable position embeddings rather than 2-D-aware variants (sin-cos along $H$ and $W$, or factorised row/column embeddings)?
Compare: How does the encoder block of ViT relate to the Chapter 1 Transformer encoder? Name what is preserved (architecture, pre-LN, residual connections) and what is added on top (patch front end, [CLS] token, learnable absolute position embedding).

6. The data-hunger result¶

The ViT paper's experimental claim is not that the architecture is novel; nearly everything about it had appeared somewhere in the literature already. The claim is that scale — pretraining data plus pretraining compute — is what makes pure Transformers competitive on vision, and that with enough scale they are not just competitive but better.

The headline numbers from Section 4.2 of the paper [src_013]: ViT-L/16 pretrained on JFT-300M reaches 87.76% top-1 on ImageNet at fine-tuning, against 87.54% for BiT-L (Big Transfer; a very strong ResNet152x4 baseline pretrained on the same JFT-300M dataset, Kolesnikov et al. 2020), while requiring substantially less pretraining compute (0.68k TPUv3-core-days against 9.9k for BiT-L). The larger ViT-H/14 reaches 88.55% on ImageNet at 2.5k TPUv3-core-days, beating Noisy-Student EfficientNet-L2 and BiT-L on ImageNet, CIFAR-100, and the 19-task VTAB suite [src_013].

The flip side, made explicit in Section 4.3 [src_013]: when pretraining is restricted to ImageNet-1k alone, ViT-Large underperforms ViT-Base, and both underperform comparable ResNet baselines.

🎯 Intuition

With enough data, the variance the bias would have controlled is controlled by the data instead, and the bias's inflexibility now costs more than it saves. At small data, the CNN's locality bias is a free regulariser; at large data, it is a constraint that prevents the model from learning patterns the data already supplies.

The paper's interpretation is that ViT carries less image-specific inductive bias than a CNN — no built-in locality outside the patch grid, no translation equivariance outside the MLPs — so on smaller datasets it overfits where a CNN would generalise [src_013]. Moving pretraining to ImageNet-21k (≈14M images) closes the gap, and only with JFT-300M do larger ViT models manifest their full potential [src_013]. Figure 4 of the paper plots this as a phase transition: with under ~30M JFT pretraining samples, BiT ResNets dominate; past ~100M, ViT-L/16 overtakes [src_013].

The honest pedagogical reading is that ViT's win was not architectural elegance but a systems result: the architecture is amenable to scaling because it borrows the NLP Transformer's accelerator-friendly compute profile, and at scale the inductive-bias deficit becomes a virtue rather than a liability. This story sits behind a great deal of the post-2020 vision research programme, including the self-supervised methods of Chapter 6.

💡 Key result

At sufficient pretraining scale (JFT-300M and beyond), pure Transformers match or exceed the strongest convolutional baselines on image recognition while using less pretraining compute, but at smaller pretraining scales the inductive-bias deficit hurts and CNNs win.

7. DeiT and data efficiency¶

The data-hunger result raised an immediate practical question: can ViT-class models be trained on academia-sized image budgets, namely ImageNet-1k alone, without the 300M-image private corpora of the original paper? The DeiT (Data-efficient image Transformers) recipe of Touvron et al. is the most influential answer [src_050]. DeiT pairs a ViT-Base architecture with strong augmentation (RandAugment, Mixup, CutMix, random erasing — four standard image-augmentation strategies, details outside the scope of this chapter) and a 300-epoch training schedule on ImageNet-1k alone, and adds a distillation token — a second register-style token (a non-image learnable token that the encoder can write into; the [CLS] token is the canonical example) alongside [CLS] that learns to match the predictions of a CNN teacher through attention; the default teacher in the paper is RegNetY-16GF (84M parameters, 82.9% top-1) [src_050]. DeiT-B reaches 83.1% top-1 without distillation and 85.2% with it, on a single node in two to three days, making the resulting model competitive with strong convnet baselines at lower pretraining compute [src_050].

DeiT matters here because it is the first widely-adopted demonstration that the ViT architecture is not intrinsically data-hungry; the data-hunger result is about a particular training recipe. With enough augmentation and a CNN teacher to distill from, the same ViT-B that fails on ImageNet-1k from scratch trains successfully. This loosened the expectation that JFT-scale corpora are mandatory for vision Transformers and set the stage for the self-supervised pretraining methods of Chapter 6 — MAE, DINOv2 — which dropped the CNN teacher entirely.

🔗 Connection

Chapter 6 picks up the self-supervised pretraining recipes — MAE (masked-patch autoencoders), DINOv2 (self-distillation without text supervision), CLIP / SigLIP (image-text contrastive), and SAM-2 (segmentation foundation model) — that follow naturally from the data-hunger / data-efficiency arc this chapter just developed.

8. Swin Transformer V2: convolutional priors return¶

ViT bet that scaling data could substitute for inductive bias. A parallel research line bet the opposite: that re-introducing convolution-style structure into the Transformer encoder would pay off on dense prediction tasks (object detection, semantic segmentation) and at high resolutions, where vanilla ViT's quadratic-cost global attention becomes prohibitive. Liu et al.'s Swin Transformer is the canonical instance of that line, and Swin Transformer V2 is its scaled-up successor [src_042].

Swin's two architectural moves are hierarchy and locality. Instead of ViT's flat sequence of $N$ patches at a single resolution, Swin builds a feature pyramid: tokens are merged across stages so that the spatial resolution halves and the channel dimension doubles, mimicking a ResNet's stage-wise downsampling. Within each stage, self-attention is computed only inside non-overlapping windows of fixed size, and consecutive blocks alternate between regular and shifted-window partitions so that information eventually propagates between windows.

🎯 Intuition

Imagine a $4 \times 4$ grid of windows tiling the image in block $N$; in block $N+1$ the grid is shifted by half a window in each dimension, so each new window straddles four old ones. After two consecutive blocks every window has "seen" some information from each of its diagonal neighbours, even though no individual block computes attention across windows.

The vanilla Transformer encoder is recovered as the limit of one window covering the whole image; Swin's window attention is the engineering compromise that gives back locality and translation equivariance to the inductive-bias budget [src_042].

Swin V2 specifically targets scaling. Three changes carry the weight: a residual-post-norm configuration combined with scaled cosine attention to control activation amplitudes at depth (large Swin V1 models exhibited the activation-magnitude growth across layers that motivates many post-norm rescues in the literature); a log-spaced continuous position bias that makes pretrained models transferable across window sizes without the bicubic-interpolation hack; and a SimMIM self-supervised pretraining stage that reduces the labelled-data requirement [src_042]. The aggregate result is a 3-billion-parameter dense vision model trained at 1,536 x 1,536 resolution, which the authors describe as the largest dense vision model of its time and which set then-current records on ImageNet-V2 classification, COCO detection, ADE20K segmentation, and Kinetics-400 action recognition [src_042]. Pedagogically, Swin V2 is best read as the demonstration that the ViT bet is not the only point on the design curve: re-introducing structure pays off at fine-grained recognition and high resolution, where the data-bias trade-off shifts.

🔄 Recap

Explain: What mechanism makes the data-trumps-bias result kick in at JFT-300M scale and not at ImageNet-1k? Use the bias-variance trade-off in your answer.
Predict: Given a 14M-image ImageNet-21k corpus and ViT-Base, would DeiT-style augmentation (RandAugment, Mixup, CutMix, random erasing, distillation-token) still be necessary to match a CNN baseline at this corpus size?
Compare: For a high-resolution dense-prediction task (semantic segmentation at $1{,}024 \times 1{,}024$), would you prefer plain ViT-L/14 or a Swin V2 encoder, and why? Frame the answer in terms of attention-cost asymmetry and inductive-bias supply.

9. ViT in 2026¶

Five years on, the picture has clarified. Plain ViT — patch embedding, [CLS] token, learnable absolute position embedding, vanilla Transformer encoder — remains the dominant CV backbone for self-supervised pretraining and for the vision side of multimodal models, and the architectural variants that re-introduce convolutional structure occupy a complementary niche rather than displacing it. Courant et al.'s 2023 survey treats this convergence as the steady state of the field: ViT-class encoders have absorbed most of the research momentum on representation learning, while Swin-class encoders retain an advantage on high-resolution dense prediction [src_009].

The drivers of plain ViT's persistence are pragmatic. Self-supervised methods that operate by patch masking, in particular MAE (the topic of Chapter 6), are easiest to formulate over a flat patch sequence; the asymmetric encoder–decoder design that makes MAE efficient relies on the encoder seeing only a subset of tokens, which a hierarchical Swin-style backbone makes harder. Multimodal vision-language models — SigLIP, CLIP, and the vision encoders used inside the open-weights VLMs of 2024–2026 — almost universally tokenize an image as a flat patch sequence and feed the [CLS] token or a pooled set of patch tokens into a text-aligned projection. And at the frontier of pretraining-data scaling, plain ViT-L/14 or ViT-G/14 with bigger pretraining corpora consistently beats clever architectural variants trained on smaller corpora; the data-trumps-bias story of the original paper has, if anything, strengthened with each subsequent year.

🔗 Connection

CLIP, SigLIP, and the open-weights VLMs of 2024–2026 named here are developed in Chapter 6; this chapter uses them only as exemplars of the plain-ViT-as-vision-encoder pattern. VLMs (vision-language models) themselves are out of scope for this book.

Two qualifications. First, the encoder block in modern plain ViT is no longer the 2020 LayerNorm-and-GELU block; modern recipes routinely swap LayerNorm for RMSNorm and the GELU FFN for SwiGLU, exactly as in modern decoder-only LLMs. The patch front end, [CLS] token, learnable absolute position embedding, and overall residual-stream skeleton are unchanged. Second, the position-embedding gap between ViT and decoder-only LLMs has begun to close from the LLM side: rotary variants for vision exist (RoPE-2D, axial RoPE) and are slowly displacing learnable absolute embeddings in research code, particularly when models must operate at variable resolutions.

⚠️ Pitfall

RoPE-2D and axial RoPE are 2-D extensions of rotary position encoding (Chapter 2). They are not covered in detail anywhere in this book; the §9 mention is a forward pointer, not a definition. A reader implementing them today should consult the relevant primary papers.

As of 2026 this transition is not yet universal, and a baseline ViT implementation can still safely use the original learnable scheme.

⚠️ Pitfall

When this chapter says "plain ViT" it means the patch-front-end / [CLS] / learnable absolute position embedding / residual-stream skeleton — NOT necessarily the 2020 LayerNorm-and-GELU encoder block. In modern (2024–2026) practice the encoder block is RMSNorm + SwiGLU + (optionally) RoPE-2D, exactly as in modern decoder-only LLMs.

10. Summary and forward pointer¶

Vision Transformers turn an image into a token sequence by chopping it into non-overlapping patches, linearly projecting each patch to a $D$-dimensional vector, prepending a learnable [CLS] token, and adding a learnable absolute position embedding. The patch embedding is mathematically equivalent to a single Conv2d with kernel size and stride both equal to the patch size $P$, and that equivalence is what production code uses. The encoder downstream is the Transformer encoder of Chapter 1 with no vision-specific modifications.

The original ViT paper's substantive claim was empirical: at sufficient pretraining scale (JFT-300M-class corpora), pure Transformers match or exceed the strongest convolutional baselines on image recognition while using less pretraining compute, despite carrying less image-specific inductive bias [src_013]. At smaller pretraining scales the bias gap hurts and CNNs win; the data-hunger result is the load-bearing pedagogical claim, not architectural elegance. DeiT later showed that the data requirement was an artefact of the supervised-pretraining recipe rather than the architecture, and Swin Transformer V2 showed that re-introducing convolutional priors pays off at high resolution and on dense prediction tasks [src_042].

Chapter 6 picks up where this chapter ends. Once the architecture is settled, the question is how to pretrain it without labels. MAE rebuilds the BERT-style masked-prediction recipe over patches. DINOv2 trains discriminatively without text supervision and matches CLIP-class features. SigLIP replaces softmax contrastive with a sigmoid loss and decouples the loss from the global batch. SAM-2 takes a ViT-class encoder and turns it into a video segmentation foundation model. All four sit on top of the architecture this chapter built.

🔄 Recap

Complete: In one sentence, state what is preserved and what is swapped between a 2020 plain ViT and a 2026 modern plain ViT.
Explain: Why does plain ViT remain the dominant CV backbone for self-supervised pretraining and the vision side of multimodal models, despite Swin's advantages on dense-prediction tasks at high resolution?
Predict: If a research team in 2026 picks plain ViT-L/14 over Swin V2 for a new SSL pretraining run on a 1B-image corpus, what is the most likely reason — list at least two of the pragmatic drivers the chapter named.
Generate: In your own words, write one paragraph stating the chapter's central claim about ViT — covering the patch front end / Ch.1 encoder reuse / data-hunger / DeiT / Swin counterpoint and the modern-ViT-block scoping.

References¶

[src_001] Bishop, C. M., & Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer. https://www.bishopbook.com/
[src_008] Torralba, A., Isola, P., & Freeman, W. T. (2024). Foundations of Computer Vision. MIT Press. https://visionbook.mit.edu/
[src_009] Courant, R., Edberg, M., Dufour, N., & Kalogeiton, V. (2023). Transformers and Visual Transformers. In Machine Learning for Brain Disorders (Humana/Springer). https://doi.org/10.1007/978-1-0716-3195-9_6
[src_013] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929. https://arxiv.org/pdf/2010.11929
[src_042] Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., & Guo, B. (2022). Swin Transformer V2: Scaling Up Capacity and Resolution. arXiv:2111.09883. https://arxiv.org/pdf/2111.09883
[src_048] Krig, S. (2025). Computer Vision Metrics — Ch. 11: Attention, Transformers, Hybrids, and DDNs. Springer. Survey supplement; cited as a taxonomy pointer for further reading. https://doi.org/10.1007/978-981-99-3393-8_11
[src_050] Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. arXiv:2012.12877 (ICML 2021). https://arxiv.org/pdf/2012.12877