Skip to content

Self-Supervised Vision

The Vision Transformer (ViT) chapter showed that a Transformer encoder, fed flat patch embeddings, can match or exceed convolutional networks at scale. It also showed the cost of that bet: ViT carries no built-in locality or translation-equivariance prior, so it depends on enormous labelled datasets — JFT-300M in the original paper — to extract every drop of its capacity. This chapter is about the natural question that follows. If a ViT can be trained without labels, can the labels go away?

The answer, in 2024–2026, is essentially yes. The strongest publicly available visual backbones used inside vision-language models (VLMs), inside open-vocabulary segmentation systems, inside image retrieval pipelines, and inside dense-prediction stacks are not supervised classifiers; they are self-supervised features. Four of those backbones — MAE, DINOv2, SigLIP, and SAM-2 — anchor this chapter. Each represents a different route to label-free or weakly-labelled visual features, and each is a working example of how the ViT bet pays off when paired with a self-supervised pretext task.

1. Why self-supervision in vision

The argument for self-supervised vision was originally an argument by analogy with NLP. BERT and GPT showed that masking or autoregressively predicting tokens of natural language — pretext tasks with no human labels in the loop — produces representations that transfer to nearly every downstream task. Convolutional grids resisted the same trick for years: they did not naturally accept BERT-style mask tokens, image signals are spatially redundant in a way text is not, and image decoders reconstruct pixels rather than semantically rich tokens [src_038]. The introduction of ViT removed the architectural obstacle: a patch sequence is a sequence of tokens, and mask tokens can be inserted just as in BERT [src_038]. The information-density obstacle — that a missing patch can usually be filled by interpolating from neighbours — turned out to be addressable by simply masking far more aggressively than BERT does. The semantic-level obstacle — pixels are a low-level reconstruction target — turned out to be addressable by changing the encoder/decoder split rather than the loss.

🔗 Connection

The BERT analogy this section leans on is developed in Chapter 7 (encoder-decoder family). BERT pretrains a Transformer encoder by masking ~15% of input tokens and reconstructing them via masked language modelling (MLM); MAE adapts the same idea to images, but with a 75% masking ratio because image patches are spatially redundant in a way text tokens are not.

The Torralba/Isola/Freeman text frames self-supervised representation learning as one of the central organising ideas of modern computer vision, sitting alongside detection, segmentation, and generation as a backbone-producing technology rather than a task in its own right [src_008]. The Courant et al. survey, treating "training without labels" as one of the principal extension axes of visual transformers, takes the same view from the transformer side [src_009]. A reader who learns supervised ImageNet pretraining as the default vision recipe is, in 2026, learning a recipe that has been displaced. The features that fine-tune best, the features that retrieve best frozen, and the features that connect to language models all come out of self-supervised pretraining.

🔗 Connection

Chapter 5 (Vision Transformers) develops ViT as a patch-sequence Transformer encoder; this chapter takes that encoder as a given and replaces the labelled-classification pretext task with self-supervised alternatives. Every reference to "patch sequence" or "ViT tokenisation" below assumes the §2 / §4 framing of Chapter 5.

2. The MAE recipe

Masked Autoencoders (MAE) are the cleanest BERT analogue in vision [src_038]. The recipe has three moving parts. First, the input image is split into non-overlapping patches following the standard ViT tokenisation. Second, a random subset of the patches is masked — and crucially, the masking ratio is high. The default in the MAE paper is 75%, with strong fine-tuning accuracy across the range of 40% to 80% [src_038]. This is in sharp contrast with BERT's typical 15% masking ratio for text. Third, the encoder consumes only the visible 25% of patches; a separate, lightweight decoder receives the encoded visible tokens together with shared learned mask tokens at the missing positions, and reconstructs pixel values for the masked patches. The reconstruction loss is mean squared error in pixel space, computed only on the masked patches [src_038].

The asymmetric encoder-decoder design is the engineering win. The encoder is the part that will be deployed downstream, so it is heavy: a full ViT-Large or ViT-Huge. The decoder exists only at pretraining time and is intentionally small — the default is 8 Transformer blocks of width 512, contributing roughly 9% of the encoder FLOPs per token [src_038]. Because mask tokens enter only at the decoder, the encoder runs on a quarter of the sequence length. The result is a wall-clock speedup of \(3\times\) or more relative to a symmetric design that puts mask tokens through the encoder, and a corresponding reduction in memory that lets MAE scale to ViT-Huge on ImageNet-1k alone [src_038].

The headline transfer number is that a ViT-Huge pretrained with MAE on ImageNet-1k and then fine-tuned at 448-resolution reaches 87.8% top-1 on ImageNet-1k validation, the strongest result among methods using only ImageNet-1k data at that time [src_038]. The earlier supervised recipe needed JFT-300M to extract the same accuracy. Pretraining without labels, on the same image collection, beat training with labels.

A minimal pseudocode sketch of the masking step makes the implementation feel concrete:

import torch

def mae_mask(tokens: torch.Tensor, mask_ratio: float = 0.75):
    """Random patch masking for MAE. tokens has shape (B, N, D)."""
    B, N, D = tokens.shape
    n_keep = int(N * (1.0 - mask_ratio))
    noise = torch.rand(B, N, device=tokens.device)
    ids_shuffle = torch.argsort(noise, dim=1)
    ids_keep = ids_shuffle[:, :n_keep]
    ids_restore = torch.argsort(ids_shuffle, dim=1)
    visible = torch.gather(
        tokens, dim=1,
        index=ids_keep.unsqueeze(-1).expand(-1, -1, D),
    )
    return visible, ids_restore

Per-image random shuffling, take the first n_keep indices as visible, remember the inverse permutation so the decoder can place mask tokens back at the right positions. No specialised sparse operator is needed, and the encoder simply runs on visible [src_038].

💡 Key result

MAE pretraining on ImageNet-1k alone, without labels, lets a ViT-Huge at 448-resolution beat the strongest supervised recipe of its era — which needed JFT-300M to reach the same accuracy.

3. Why MAE works

🎯 Intuition

The mechanism §3 will defend in three observations is shape over texture: at high masking ratios, the only way to fill the holes is to internalise object shape, gestalt, and global structure, because local texture cues are gone. The three observations below decompose this single thesis into a finding about masking ratio, a finding about mask-token placement, and a finding about reconstruction target.

The pixel-MSE loss looks like a low-level objective; if it were the whole story, the encoder would learn texture statistics and little more. The fact that an MAE encoder transfers strongly to fine-tuning and to dense-prediction tasks means something more is happening, and the MAE ablations make the mechanism legible.

🔗 Connection

Linear probing freezes the encoder and trains a single linear classifier on top; fine-tuning unfreezes the encoder. Chapter 5 (Vision Transformers) discusses both, and the §8 contrast in this chapter (MAE strongest after fine-tuning, DINOv2 strongest frozen) depends on the distinction.

The first observation is that the masking ratio is the key hyperparameter. At low ratios — closer to BERT's 15% — a missing patch can be reconstructed by extrapolating from its neighbours [src_038]. At a ratio of 75% only a handful of patches survive, and the reconstruction task can no longer be reduced to local interpolation. The MAE paper shows ImageNet-1k linear probe accuracy climbing from 54.6% at 10% masking to 73.5% at 75% masking, and dropping again at extreme 90% ratios [src_038].

🤔 Pause and reflect

At 90% masking, only ~10% of patches remain visible. Why does linear-probe accuracy drop at this extreme rather than continuing to rise? What changes about the reconstruction task when too few patches survive? (Do not look ahead — write the answer down or say it out loud.)

The high-ratio task forces the encoder to represent shape, gist, and the gestalt of objects rather than texture, because nothing else suffices to fill the holes.

The second observation is that the mask token must not enter the encoder. An MAE variant that feeds mask tokens through the encoder loses 14% in linear probing, because the encoder's input distribution at pretraining time then differs sharply from its input distribution at deployment time, where every patch is real [src_038]. Removing the mask token from the encoder eliminates this train-test distribution gap and is, simultaneously, the source of the wall-clock speedup. The asymmetric design and the high mask ratio reinforce each other: the mask ratio creates a hard reconstruction task, and the asymmetry keeps the encoder cheap enough that the hard task can be trained for many epochs.

A third observation, useful for cross-referencing the encoder/decoder taxonomy treated elsewhere in this book, is that the reconstruction target choice is not particularly sharp. Normalised pixels — pixel values shifted and scaled by per-patch mean and standard deviation — outperform unnormalised pixels by roughly 0.5%, and dVAE tokens (the BEiT-style discrete target) match normalised pixels statistically [src_038]. The simpler recipe wins by a hair because dVAE tokens require an additional pretraining stage on a separate image corpus. The structural lesson is that the encoder-decoder split, not the choice of reconstruction target, carries the load.

🔄 Recap

  • Complete: MAE's default masking ratio is _____, and at the extreme of 90% the linear-probe accuracy _____ rather than continuing to rise.
  • Explain: Why must mask tokens not enter the encoder? Name the train-test distribution argument and the wall-clock-speedup consequence in one sentence each.
  • Predict: Given normalised pixels, unnormalised pixels, and dVAE tokens as candidate reconstruction targets — which wins, by how much, and why is the gap small enough that the §3 prose calls the choice "not particularly sharp"?

4. DINOv2

MAE proves that pixel-level reconstruction works as a pretext task at scale. DINOv2 makes the parallel point for the discriminative family: a student-teacher self-distillation objective, with no text and no human labels, can produce general-purpose visual features that match or surpass features trained with text supervision [src_039].

The DINOv2 training loop has three load-bearing components [src_039].

🎯 Intuition

Before the technical names land, here is what each load-bearing piece is actually doing.

Component What it does Plain-language picture
DINO image-level loss Whole-image self-distillation Student matches teacher's prediction over learned prototypes for the same image
iBOT patch-level loss Masked-patch self-distillation Student's masked-patch features must match teacher's unmasked-patch features
EMA teacher Slow-moving snapshot of the student The teacher is the student's own past — no external label source

The three further stabilisers (Sinkhorn-Knopp, KoLeo, high-resolution adaptation) keep this loop from collapsing.

An image-level loss derived from DINO (the 2021 self-distillation predecessor of DINOv2, with the same student-teacher cross-entropy-over-prototypes structure but on a smaller corpus) computes the cross-entropy between student and teacher predictions over a learned set of prototypes, with each network seeing a different crop (multi-crop augmentation) of the same image. A patch-level loss derived from iBOT (the 2022 masked-image-modelling counterpart of DINO; same self-distillation structure applied to masked patch tokens rather than whole-image views) does the same thing at the patch token level, with masked patches in the student matched against unmasked patches in the teacher. The teacher is built from an exponential moving average of the student parameters; this is what makes the procedure self-distillation rather than supervised distillation. Three further stabilisers — Sinkhorn-Knopp centring of the teacher outputs, a KoLeo regulariser that spreads features uniformly across the batch, and a short high-resolution adaptation phase — let the procedure scale to a billion-parameter ViT on a hundred-million-image dataset [src_039].

The data pipeline is, honestly, much of the story. DINOv2 is trained on LVD-142M, a corpus of 142 million curated images assembled from a 1.2-billion-image uncurated pool by an automatic deduplication and retrieval procedure [src_039]. The retrieval step embeds curated reference images with a self-supervised ViT-H/16 and pulls nearest neighbours from the uncurated pool. There is no metadata, no caption text, and no human label inside the loop. Even so, the feature quality on uncurated data is markedly worse than on curated data: in the paper's ablation, training on 142 million uncurated images versus the curated LVD-142M shows uncurated data trailing on most benchmarks, sometimes by several points [src_039]. The lesson is that data curation is not free; it is part of the recipe, and the engineering investment is real.

The empirical claim that anchors the chapter is that frozen DINOv2 features rival or beat OpenCLIP (the CLIP recipe — softmax over the \(|B| \times |B|\) similarity matrix, all-gather across devices — is examined in §5) across a wide range of benchmarks — image classification, instance retrieval, semantic segmentation, monocular depth — without ever seeing a caption [src_039]. A ViT-g/14 trained with DINOv2 on LVD-142M and frozen at evaluation surpasses OpenCLIP ViT-G/14 on ImageNet-1k linear probing by roughly 0.3%, with larger gains on robust test sets like ImageNet-V2 [src_039]. The interpretation is that text supervision is one source of signal, but it is not strictly necessary; a sufficiently rich self-supervised objective on a sufficiently diverse image corpus can produce comparable features. CLIP-style training remains attractive for zero-shot classification through text, but the visual encoder itself does not need text to reach the foundation-model frontier.

💡 Key result

Self-distillation on a curated 142-million-image corpus (LVD-142M) without text labels produces frozen visual features that match or surpass OpenCLIP's caption-supervised features across retrieval, dense prediction, and classification benchmarks.

5. SigLIP

SigLIP keeps the text supervision but changes the loss [src_040]. CLIP-style image-text contrastive training applies a softmax twice per batch — image-to-text and text-to-image — over the full \(|B| \times |B|\) pairwise similarity matrix. To compute that softmax across data-parallel devices, every device must all-gather the image and text embeddings of every other device (all-gather is a distributed-training collective: every device sends its local embeddings to every other device, so all devices end up with the full batch), and the \(|B| \times |B|\) matrix must be materialised in memory on at least one of them. The numerical-stability trick of subtracting the maximum logit before exponentiating requires a further pass over the full batch [src_040]. The recipe is correct, but it is also expensive and it couples loss-definition to batch size: the paper one is implicitly comparing against has a batch of \(|B|\), and changing the batch changes the task.

🎯 Intuition

Imagine the \(|B| \times |B|\) similarity matrix between every image and every text in the batch. CLIP normalises along each row and each column with two softmaxes, treating the matching pair as one of \(|B|\) candidates per row (and per column). SigLIP just asks each cell yes/no: is this image-text pair a match? The matching diagonal is positive, every off-diagonal cell is negative, and the total loss is a sum of independent binary classifications — one per cell of the matrix.

SigLIP replaces the global softmax with a per-pair sigmoid. Each image-text pair becomes an independent binary classification — positive for the matching pair, negative otherwise — and the loss for a single pair \((i, j)\) takes the form

\[ L_{ij} = \log \sigma\!\big(z_{ij}\,(t \cdot \mathbf{x}_i \cdot \mathbf{y}_j) + b\big), \]

where \(\mathbf{x}_i = f(I_i) / \lVert f(I_i) \rVert_2\) is the L2-normalised image embedding, \(\mathbf{y}_j = g(T_j) / \lVert g(T_j) \rVert_2\) is the L2-normalised text embedding, \(t = \exp(t_0)\) is a learnable temperature, \(b\) is a learnable bias scalar, and \(z_{ij} = +1\) when \(i = j\) and \(z_{ij} = -1\) otherwise [src_040]. The total batch loss is \(-\frac{1}{|B|}\sum_{i,j} L_{ij}\) summed over all pairs in the batch.

🤔 Pause and reflect

Set \(z_{ij} = +1\) in the loss above: what does \(\log \sigma\) reward (high or low similarity)? Now set \(z_{ij} = -1\): what is rewarded (high or low similarity)? Why does this two-cases-from-one-equation construction make each cell of the similarity matrix an independent binary classifier? (Do not look ahead.)

The decoupling consequence is the load-bearing point. Because the loss is a sum of independent per-pair terms, no global all-gather is required and no \(|B| \times |B|\) matrix is ever materialised; each device computes its share, swaps embeddings with a neighbour, and accumulates [src_040]. The per-device memory cost goes from \(|B|^2\) to \(b^2\) where \(b = |B|/D\) is the per-device batch and \(D\) is the device count. SigLIP scales to a global batch of one million on relatively few accelerators, and at small batch sizes — the paper highlights below 16k as the regime where the gap is most visible — it outperforms softmax by a substantial margin [src_040]. Both losses saturate around 32k. The takeaway is not that bigger batches are unhelpful but that the "huge batch is mandatory" regime of CLIP was, in part, an artifact of the softmax recipe.

Distributed implementation

A pseudocode sketch of the per-batch loss, following the paper's algorithm 1, fits in a few lines:

import torch
import torch.nn.functional as F

def siglip_loss(img_emb: torch.Tensor, txt_emb: torch.Tensor,
                t_prime: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    """SigLIP sigmoid loss. img_emb, txt_emb have shape (n, dim).
    t_prime and b are learnable scalars."""
    n = img_emb.shape[0]
    t = t_prime.exp()
    z_img = F.normalize(img_emb, dim=-1)
    z_txt = F.normalize(txt_emb, dim=-1)
    logits = z_img @ z_txt.T * t + b
    labels = 2.0 * torch.eye(n, device=logits.device) - 1.0
    return -F.logsigmoid(labels * logits).sum() / n

The sign matrix labels is \(+1\) on the diagonal and \(-1\) elsewhere, exactly mirroring the \(z_{ij}\) scalar in the equation above [src_040]. Note that this snippet returns the per-device contribution; in the full distributed implementation the embeddings are permuted across devices and the per-device losses are summed, but no all-gather is required.

💡 Key result

Replacing CLIP's softmax with a per-pair sigmoid decouples the loss from the batch dimension: the same recipe scales to a global batch of one million on relatively few accelerators and outperforms softmax below 16k batch size.

🔄 Recap

  • Explain: Why is DINOv2 called "self-distillation rather than supervised distillation"? Name the EMA-as-self mechanism in one sentence.
  • Compare: DINOv2 and SigLIP are both routes around the supervised-classification bottleneck. What does each remove and what does each keep? Which is text-free, and which keeps text but cheapens the loss?
  • Predict: Given a 1B-parameter ViT-g/14 trained with DINOv2 on LVD-142M and frozen at evaluation, predict whether it beats OpenCLIP ViT-G/14 on ImageNet-1k linear probing — and roughly by how much, and on which kind of test set the gap is largest.

6. SAM-2

SAM-2 sits a layer above the previous three. It is a segmentation foundation model, not a feature extractor, but it leans on the same machinery: an MAE-pretrained ViT-style image encoder, plus task-specific heads and a memory module that turns the system from an image segmenter into a video segmenter [src_041]. Examining its architecture is instructive because it shows what happens when self-supervised features are deployed downstream rather than evaluated on linear probes.

The task SAM-2 solves is Promptable Visual Segmentation: given a video and a prompt — a point click, a bounding box, a mask, or a combination — on any frame, produce a spatio-temporal mask (a "masklet") that segments the prompted object across the entire video [src_041]. The architecture has four components [src_041]. First, an image encoder runs once per frame and emits unconditioned spatial features. SAM-2 uses a hierarchical Hiera backbone (a 2023 hierarchical ViT — multi-stage feature pyramid, no shifted windows, only MAE-trainable; sibling to Swin in spirit but simpler), which is itself MAE-pretrained, threading the recipe of section 2 directly into a downstream foundation model. Second, a memory attention module conditions the current frame's features on memories from past frames, by stacking transformer blocks that perform self-attention followed by cross-attention to a memory bank. Third, a prompt encoder turns clicks, boxes, or masks into prompt tokens. Fourth, a lightweight mask decoder predicts the per-frame segmentation mask.

The memory bank is the temporal-extension trick. It is a FIFO queue of up to \(N\) recent unprompted frames' spatial features, plus a separate FIFO queue of up to \(M\) prompted-frame memories whose temporal positions are not encoded — because at inference time prompted frames may come from a temporal range very different from training [src_041]. Alongside these spatial memories, the bank stores a list of object pointers — lightweight vectors derived from the mask decoder's output tokens that carry high-level semantic information about the segmented object. Memory attention cross-attends to both the spatial memory features and the object pointers.

⚠️ Pitfall

Each component of the memory bank carries a different kind of information. The two FIFO queues hold spatial context (the \(N\) unprompted frames carry recent appearance, the \(M\) prompted frames carry prompt-aligned appearance from arbitrary times). The object pointers carry semantic identity — the what of the segmented object. Single-click recovery from occlusion works because the object pointer survives even when the spatial features no longer match the visible frame; the spatial queues alone would lose the object across long disappearances.

The combination is what lets SAM-2 propagate masklets across long videos and recover from occlusion: when an object disappears and reappears, a single corrective click on a later frame is enough to put the model back on track, because the memory bank still carries the object's identity [src_041].

The training data is the SA-V dataset: roughly 51,000 videos with 643,000 masklets, collected through a model-in-the-loop data engine that improved annotation efficiency from 38 seconds per frame in the SAM-only phase to about 5 seconds per frame in the fully-featured SAM-2 phase [src_041]. The volume — about 53 times more masks than the previous largest video segmentation dataset — is what makes the streaming-memory architecture trainable at all.

For the purposes of this chapter, SAM-2 illustrates two things. First, the MAE-pretrained Hiera encoder is not an academic curiosity; it is the load-bearing visual backbone of a deployed segmentation system [src_041, src_038]. Second, the decoupling between unconditioned per-frame features and a separate temporal memory module is a clean architectural pattern: the image encoder, trained once with MAE, can be re-used across many tasks; the memory attention is the task-specific addition that turns it into a video segmenter.

7. Where this is heading

Modern vision-language models — Llama 3.2 Vision and Qwen2-VL are the obvious 2024–2025 reference points — attach a vision encoder of exactly the kinds discussed in this chapter to a decoder-only language model. The vision tower is typically a SigLIP-style or DINOv2-style ViT, and the projection layer that maps visual tokens into the LLM's residual stream is a thin trainable adapter rather than a separate model [src_009]. This chapter does not treat VLMs in any depth; the full treatment is the natural starting point for a future chapter on multimodal models. The pointer is worth giving here because it explains why the four pretraining recipes covered above are not academic curiosities. They are the visual backbones of every modern multimodal system, and the choice of which to use at the front of a VLM is a live engineering decision rather than a settled one.

🔗 Connection

VLMs (vision-language models) — including Llama 3.2 Vision, Qwen2-VL, and the open-weights releases of 2024-2026 — are out of scope for this book. When the VLM chapter is written, this section will be reduced to a single Connection callout pointing forward to it; for now §7 is a placeholder marking the structural slot.

8. Closing summary

Four recipes, four ways of getting around the supervised-classification bottleneck, and four different downstream uses. The table below organises them by training signal, by what kind of label requirement they impose, and by what they are most often deployed for in 2024–2026.

Method Training signal Label requirement Dominant downstream use
MAE [src_038] Pixel reconstruction (MSE on masked patches) None Fine-tuning baseline; pretraining for hierarchical encoders such as Hiera
DINOv2 [src_039] Self-distillation across student/teacher views (DINO + iBOT, multi-crop) None (curated image set; no captions) Frozen-feature retrieval, depth, segmentation, fine-grained classification
SigLIP [src_040] Per-pair sigmoid contrastive loss on image-text pairs Image-text pairs (web-scale) Multimodal alignment; vision tower for VLMs
SAM-2 [src_041] Mask prediction with streaming memory; MAE-pretrained Hiera backbone Human masks (SA-V data engine) Promptable segmentation in images and videos

The two label-free recipes — MAE and DINOv2 — solve different problems. MAE produces a backbone that is strongest after fine-tuning, in the sense that its encoder representations are not particularly linearly separable but become very strong once a few transformer blocks are tuned [src_038]. DINOv2 produces a backbone that is strongest frozen, because the discriminative self-distillation objective explicitly shapes the feature space [src_039]. The two weakly-labelled recipes — SigLIP and SAM-2 — solve different problems too. SigLIP aligns visual and textual representations through a loss that decouples from batch size, making large-scale image-text training cheap and small-scale image-text training viable [src_040]. SAM-2 layers a temporal memory and a promptable mask decoder on top of an MAE-pretrained backbone to deliver a deployed segmentation system [src_041].

The forward pointers are short. Scaling laws (Part: Scale) explain why each of these recipes responds well to more data and more compute. Pretraining paradigms (Part: Pretraining Paradigms) make the encoder-decoder argument that frames MAE as a vision analogue of BERT-class masked autoencoding. The VLM chapter, when written, will pick up SigLIP and DINOv2 as the visual front ends of multimodal LLMs and explain the projection adapter that bridges a ViT to a Transformer decoder. Until then, the four recipes here are the working set: pixel-space masked autoencoding, discriminative self-distillation, image-text contrastive learning with sigmoid loss, and promptable segmentation with streaming memory.

🔗 Connection

Chapter 9 (Scaling laws) explains why each of these recipes responds well to more data and more compute; Chapter 7 (encoder-decoder family) makes the encoder-decoder argument that frames MAE as a vision analogue of BERT-class masked autoencoding.

🔄 Recap

  • Complete: SAM-2's memory bank holds two FIFO queues of length _____ and _____ plus a list of _____. The image encoder is a _____ backbone, itself pretrained with _____.
  • Explain: Why is single-click recovery from occlusion possible in SAM-2 but would not be possible if the bank held only the two FIFO queues without object pointers?
  • Compare: Look at the §8 four-recipes table. Which recipes are label-free, which are weakly labelled, and which is the only one that produces a deployed system rather than a backbone or feature extractor?
  • Generate: In your own words, explain why MAE features are strongest after fine-tuning while DINOv2 features are strongest frozen. What does each objective reward in feature space?

References

  • [src_008] Antonio Torralba, Phillip Isola, and William T. Freeman. Foundations of Computer Vision. MIT Press, 2024. https://visionbook.mit.edu/
  • [src_009] Robin Courant, Maika Edberg, Nicolas Dufour, and Vicky Kalogeiton. "Transformers and Visual Transformers." Humana/Springer, 2023. https://doi.org/10.1007/978-1-0716-3195-9_6
  • [src_038] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. "Masked Autoencoders Are Scalable Vision Learners." 2022. https://arxiv.org/pdf/2111.06377
  • [src_039] Maxime Oquab, Timothée Darcet, Théo Moutakanni, et al. "DINOv2: Learning Robust Visual Features without Supervision." Transactions on Machine Learning Research, 2024 (arXiv 2023). https://arxiv.org/pdf/2304.07193
  • [src_040] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. "Sigmoid Loss for Language Image Pre-Training." 2023. https://arxiv.org/pdf/2303.15343
  • [src_041] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, et al. "SAM 2: Segment Anything in Images and Videos." 2024. https://arxiv.org/pdf/2408.00714