Appendix A — Reading Roadmap¶

These are minimum-viable readings to engage with the source material behind The Living Deep Learning Book. The list is curated, not exhaustive: each chapter cites a handful of papers that, taken together, give you the load-bearing arguments without drowning in derivative work. Times are honest estimates for someone comfortable with the prerequisites assumed by each chapter — first-time readers should expect roughly 2× longer, and a careful re-reading (which most of these papers reward) often takes longer still. Papers within each chapter are listed in recommended reading order, not alphabetical.

Chapter 1 — The Transformer Block Revisited ¶

Vaswani et al., 2017 — Attention Is All You Need. https://arxiv.org/abs/1706.03762. Start here. The architecture has aged in places, but the notation, the attention diagram, and the pre/post-norm vocabulary are everywhere downstream. ~1 hr.

Chapter 2 — Rotary Position Encoding ¶

Su et al., 2021 — RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864. The original RoPE derivation. Read §3.4 carefully — the rest is application. ~1.5 hr.
(Optional) Press et al., 2022 — Train Short, Test Long: Attention with Linear Biases (ALiBi). Useful for placing RoPE inside the broader context-extension lineage; searchable on arXiv. ~45 min.

Chapter 3 — Modern Normalization and Activations ¶

Zhang & Sennrich, 2019 — Root Mean Square Layer Normalization. https://arxiv.org/abs/1910.07467. Short and direct; about four pages of useful content. ~30 min.
Shazeer, 2020 — GLU Variants Improve Transformer. https://arxiv.org/abs/2002.05202. Read just the variant table and the ablation; the rest is bookkeeping. ~20 min.

Chapter 4 — Efficient Attention at Scale ¶

Ainslie et al., 2023 — GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. https://arxiv.org/abs/2305.13245. ~30 min.
Dao et al., 2022 — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. https://arxiv.org/abs/2205.14135. Read §3.1 carefully (tiling and recomputation); the rest can be skimmed unless you care about the kernel details. ~1.5 hr.
Dao, 2023 — FlashAttention-2. https://arxiv.org/abs/2307.08691. Engineering deltas over v1: better warp partitioning, fewer non-matmul FLOPs. ~45 min.
Shah et al., 2024 — FlashAttention-3. https://arxiv.org/abs/2407.08608. Hopper-specific (asynchrony, low-precision). Skip if you are not on H100-class hardware. ~30 min.

Chapter 5 — Vision Transformers ¶

Dosovitskiy et al., 2020 — An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT). https://arxiv.org/abs/2010.11929. The patch-embedding paper. ~1 hr.
(Optional) Liu et al., 2022 — Swin Transformer V2. https://arxiv.org/abs/2111.09883. Read this when you want to understand why convolutional priors keep coming back. ~1 hr.

Chapter 6 — Self-Supervised Vision ¶

He et al., 2022 — Masked Autoencoders Are Scalable Vision Learners (MAE). https://arxiv.org/abs/2111.06377. Pixel-space masked modeling; the asymmetric encoder/decoder is the key design choice. ~1 hr.
Oquab et al., 2023 — DINOv2: Learning Robust Visual Features without Supervision. https://arxiv.org/abs/2304.07193. Self-distillation features at scale. Skim §3 closely. ~1.5 hr.
Zhai et al., 2023 — Sigmoid Loss for Language Image Pre-Training (SigLIP). https://arxiv.org/abs/2303.15343. The sigmoid contrastive loss in roughly half the length you expect. ~30 min.
Ravi et al., 2024 — SAM 2: Segment Anything in Images and Videos. https://arxiv.org/abs/2408.00714. Segmentation foundation model with memory across frames. ~1 hr.

Chapter 7 — Encoder, Decoder, and Encoder-Decoder ¶

Devlin et al., 2018 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805. Historical, but essential vocabulary. Read §3 only. ~45 min.
Liu et al., 2019 — RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://arxiv.org/abs/1907.11692. Half the paper is what BERT got wrong about training. ~30 min.
Warner et al., 2024 — ModernBERT. https://arxiv.org/abs/2412.13663. What an encoder looks like in 2024 — RoPE, FlashAttention, longer context. ~1 hr.

Chapter 8 — Inside a Modern Decoder-Only LLM ¶

Brown et al., 2020 — Language Models are Few-Shot Learners (GPT-3). https://arxiv.org/abs/2005.14165. The in-context learning paper; mostly historical now, but the framing still shapes how the field thinks. ~1.5 hr.
Touvron et al., 2023 — Llama 2: Open Foundation and Fine-Tuned Chat Models. https://arxiv.org/abs/2307.09288. The first openly released frontier-quality recipe; the data and safety appendices are worth reading. ~1 hr.
Meta, 2024 — The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783. Over-training, data curation, scaling decisions. Long but the architecture sections reward careful reading. ~3 hr.
DeepSeek-AI, 2024 — DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437. The current frontier MoE recipe — auxiliary-loss-free balancing, MLA, FP8 training. ~3 hr.

Chapter 9 — Scaling Laws ¶

Kaplan et al., 2020 — Scaling Laws for Neural Language Models. https://arxiv.org/abs/2001.08361. Read with skepticism; the conclusions were partially wrong but the methodology is canonical. ~1 hr.
Hoffmann et al., 2022 — Training Compute-Optimal Large Language Models (Chinchilla). https://arxiv.org/abs/2203.15556. The rebuttal. Pay attention to the IsoFLOP plots and where the loss-fitting machinery is doing the heavy lifting. ~1.5 hr.
Meta, 2024 — Llama 3 Herd, §3 (cited above). The post-Chinchilla regime — over-training for inference economics — is where the field actually lives now.

Chapter 10 — Mixture of Experts ¶

Fedus et al., 2021 — Switch Transformer. https://arxiv.org/abs/2101.03961. The simplest viable router. Worth reading even if every modern system uses top-k > 1. ~1.5 hr.
Jiang et al., 2024 — Mixtral of Experts. https://arxiv.org/abs/2401.04088. Top-2 routing in a production deployment; short and concrete. ~45 min.
Cai et al., 2024 — A Survey on Mixture of Experts. https://arxiv.org/abs/2407.06204. Treat as a taxonomy reference rather than a sequential read. ~2 hr.

Chapter 11 — From SFT to RLHF ¶

Ouyang et al., 2022 — Training Language Models to Follow Instructions with Human Feedback (InstructGPT). https://arxiv.org/abs/2203.02155. The canonical RLHF paper. Read the data-collection appendix as carefully as the algorithm. ~1.5 hr.

Chapter 12 — Direct Preference Optimization ¶

Rafailov et al., 2023 — Direct Preference Optimization: Your Language Model is Secretly a Reward Model. https://arxiv.org/abs/2305.18290. The §4 derivation is the entire point of the paper; everything else follows. ~1.5 hr.

Chapter 13 — Reasoning Models and Verifiable Rewards ¶

Wei et al., 2022 — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. https://arxiv.org/abs/2201.11903. Historical, but it set the vocabulary the field still uses. ~30 min.
Shao et al., 2024 — DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. https://arxiv.org/abs/2402.03300. The GRPO algorithm, in the paper that introduced it. ~1.5 hr.
DeepSeek-AI, 2025 — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://arxiv.org/abs/2501.12948. RL-trained reasoning at frontier scale, with an unusually candid failure-mode discussion. ~2 hr.

Total time and a weekend subset¶

The full primary list above is roughly 30–40 hours of honest reading time, more if you stop to derive things on paper or run code. If you only have a weekend and want the spine of the book in five papers, read these in order:

Vaswani et al., 2017 (Attention Is All You Need).
Su et al., 2021 (RoPE).
Dao et al., 2022 (FlashAttention).
Hoffmann et al., 2022 (Chinchilla).
Rafailov et al., 2023 (DPO).

That subset gets you the architecture, the position-encoding shift, the attention-kernel revolution, the compute-allocation argument, and the alignment turn — enough to read a 2025-vintage technical report without backfilling.

Appendix A — Reading Roadmap¶

Chapter 1 — The Transformer Block Revisited¶

Chapter 2 — Rotary Position Encoding¶

Chapter 3 — Modern Normalization and Activations¶

Chapter 4 — Efficient Attention at Scale¶

Chapter 5 — Vision Transformers¶

Chapter 6 — Self-Supervised Vision¶

Chapter 7 — Encoder, Decoder, and Encoder-Decoder¶

Chapter 8 — Inside a Modern Decoder-Only LLM¶

Chapter 9 — Scaling Laws¶

Chapter 10 — Mixture of Experts¶

Chapter 11 — From SFT to RLHF¶

Chapter 12 — Direct Preference Optimization¶

Chapter 13 — Reasoning Models and Verifiable Rewards¶