Appendix B — Implementation References¶

A curated map of clean reference implementations matching the topics covered in The Living Deep Learning Book. The intent is to give one or two trustworthy starting points per area — not an exhaustive catalogue. Entries are picked for code clarity, active maintenance, and pedagogical value rather than benchmark numbers. For any given paper, the official author-released code is usually the right place to begin; for any given architecture in production use, huggingface/transformers is usually the cleanest single-file reference.

End-to-end pretraining (small scale)¶

🔗 Connection

These reference repositories support Chapter 7, Encoder, Decoder, and Encoder-Decoder, Chapter 8, Inside a Modern Decoder-Only LLM, and Chapter 9, Scaling Laws.

karpathy/nanoGPT — https://github.com/karpathy/nanoGPT. A clean GPT-2 from scratch, with the model in roughly 300 lines of Python. The best place to read a training loop end-to-end without abstraction.
karpathy/build-nanogpt — https://github.com/karpathy/build-nanogpt. Companion repository to Karpathy's full-length YouTube reproduction; commits map to lecture stages, which is useful when you want to step through the construction.
rasbt/LLMs-from-scratch — https://github.com/rasbt/LLMs-from-scratch. Sebastian Raschka's book repository; Jupyter notebooks built up incrementally. Slower-paced and more pedagogical than nanoGPT.

Production-quality references¶

🔗 Connection

Production-grade references for the model families discussed in Chapter 8, Inside a Modern Decoder-Only LLM (Llama, Mistral, Mixtral, Qwen, Gemma, ModernBERT), with broader coverage of the pretraining workflow from Chapter 7, Encoder, Decoder, and Encoder-Decoder and Chapter 9, Scaling Laws.

meta-llama/llama — https://github.com/meta-llama/llama. Official Llama ⅔ inference reference. Short, unobfuscated PyTorch — useful as the canonical example of a modern decoder-only block.
meta-llama/llama-recipes — https://github.com/meta-llama/llama-recipes. Official fine-tuning, evaluation, and deployment recipes; treat as a pattern library.
huggingface/transformers — https://github.com/huggingface/transformers. For any specific model, look at src/transformers/models/<arch>/modeling_*.py. The cleanest single-file reference for ViT, BERT, Llama, Mistral, Mixtral, Qwen, Gemma, ModernBERT, and most of what this book discusses.
huggingface/nanotron — https://github.com/huggingface/nanotron. Production-grade pretraining at scale; the companion code to the Hugging Face Ultra-Scale Playbook. Read when nanoGPT is no longer enough.

Inference and serving¶

🔗 Connection

These engines implement the inference-time techniques introduced in Chapter 4, Efficient Attention at Scale (KV-cache, PagedAttention, speculative decoding, continuous batching).

pytorch-labs/gpt-fast — https://github.com/pytorch-labs/gpt-fast. Reference for torch.compile plus speculative decoding in a few hundred lines. Excellent for understanding what the compiler actually does to a transformer forward pass.
vllm-project/vllm — https://github.com/vllm-project/vllm. Production inference engine; the canonical reference implementation of PagedAttention and continuous batching.

FlashAttention and kernels¶

🔗 Connection

FlashAttention is introduced in Chapter 4, Efficient Attention at Scale; these kernels are the production realisation of that chapter's IO-aware attention design.

Dao-AILab/flash-attention — https://github.com/Dao-AILab/flash-attention. The canonical implementation of FlashAttention v1, v2, and v3. The Python wrappers are readable; the CUDA kernels are not, but you do not need them to understand the design.
state-spaces/mamba — https://github.com/state-spaces/mamba. Tri Dao's state-space sequence model. Useful context if your reading drifts beyond attention into the alternatives that almost displaced it.

Mixture of experts (MoE)¶

🔗 Connection

These references implement the routing and balancing schemes covered in Chapter 10, Mixture of Experts.

mistralai/mistral-inference — https://github.com/mistralai/mistral-inference. Mixtral routing reference. Compact and easy to read alongside the paper.
deepseek-ai/DeepSeek-V3 — https://github.com/deepseek-ai/DeepSeek-V3. Auxiliary-loss-free expert balancing in production code; the routing scheme described in the technical report.
microsoft/tutel — https://github.com/microsoft/tutel. Research-grade MoE primitives; useful when you want to swap routing or balancing schemes without rewriting the whole stack.

Alignment, RLHF, DPO, and GRPO¶

🔗 Connection

These trainers cover the alignment progression of Chapter 11, From SFT to RLHF, Chapter 12, Direct Preference Optimization, and Chapter 13, Reasoning Models and Verifiable Rewards.

huggingface/trl — https://github.com/huggingface/trl. DPO, PPO, KTO, ORPO, GRPO trainers in one place. The most pragmatic starting point for alignment experiments — code is opinionated but consistent across methods.
allenai/open-instruct — https://github.com/allenai/open-instruct. Full reproducible alignment pipelines from Allen AI, including data preparation and evaluation. Useful when you want a complete recipe rather than just trainer code.
volcengine/verl — https://github.com/volcengine/verl. GRPO and verifiable-reward training at scale; a current production-grade reference for the post-DeepSeek-R1 reasoning pipeline.

Vision¶

🔗 Connection

Reference implementations for the vision architectures of Chapter 5, Vision Transformers (ViT, Swin) and the self-supervised approaches of Chapter 6, Self-Supervised Vision (DINOv2, MAE, SAM-2).

facebookresearch/dinov2 — https://github.com/facebookresearch/dinov2. Official DINOv2 code, including the self-distillation training loop and feature-extraction utilities.
facebookresearch/mae — https://github.com/facebookresearch/mae. Official MAE; small repository, easy to read top-to-bottom.
facebookresearch/sam2 — https://github.com/facebookresearch/sam2. Official SAM 2 with the memory module for video.
microsoft/Swin-Transformer — https://github.com/microsoft/Swin-Transformer. Official Swin v1 and v2; the cleanest single source for the shifted-window attention pattern.

Caveat on freshness¶

Repository quality drifts over time: maintainers move on, branches diverge from papers, and yesterday's reference implementation can become today's deprecated artifact. The list above was current as of April 2026. Before committing to any of these as a study target — and especially before forking one for a project — check the last-commit date, open-issue count, and whether the README still matches the codebase. The author of this book welcomes issues and pull requests adding, removing, or correcting entries.