Mixture of Experts¶

1. The premise: why MoE matters in 2026¶

A dense Transformer applies every parameter to every token. That symmetry is mathematically clean and was the right default while the field was still learning what scale even meant. By 2024 it had become the wrong default at the open frontier. Mixtral 8x7B, DBRX, Grok-1, DeepSeek-V2, Qwen1.5-MoE, Hunyuan-Large, and DeepSeek-V3 all shipped through 2024 as open-weights Mixture-of-Experts (MoE) language models, and the MoE survey of Cai et al. catalogues the cluster as a single architectural wave rather than a string of isolated experiments [src_026]. Qwen3-235B-A22B and DeepSeek-V3 carry the same logic into 2025.

The reason is a single mechanical observation. A dense FFN block applied to a token of width \(D\) costs roughly \(8D^2\) parameters worth of compute on that token regardless of which token it is. An MoE FFN block keeps a much larger pool of expert FFNs in memory but only routes each token through a small subset, so the total parameter count grows with the size of the pool while the per-token compute stays bounded by how many experts are activated. The headline numbers make the trade-off concrete: DeepSeek-V3 has 671B total parameters with 37B activated for each token [src_031]. Mixtral 8x7B has 47B total with 13B active per token, roughly 5x fewer active parameters than the dense Llama-2 70B baseline that it nonetheless matches or exceeds across most benchmarks [src_025].

🔗 Connection

The over-training intuition that drove Llama-3 to push past the Chinchilla compute-optimum is the same intuition that drives MoE adoption — see Chapter 9 — Scaling Laws. The active-vs-total parameter decoupling that §8 of this chapter formalises is the architectural counterpart to over-training's data-vs-parameter decoupling.

This chapter walks through the mechanics that make that trade-off work, the failure mode (load imbalance) that almost killed the architecture in production, the two-generation cure for that failure (auxiliary loss in 2021, auxiliary-loss-free balancing in 2024), the systems cost that the architecture exchanges for its parameter efficiency (all-to-all communication during expert parallelism), and the empirical state of the question circa 2025: MoE is the open-frontier default, but dense is still alive at the very top end (Llama-3 405B), and the choice depends on what the team is optimising for [src_007, src_026, src_031].

2. Anatomy of an MoE FFN block¶

The simplest way to think of an MoE block is as a drop-in replacement for the dense FFN sub-block of a standard Transformer block. Recall that an FFN sub-block in a Transformer is two linear layers separated by a non-linearity, applied independently per token. Where a dense Transformer applies a single feed-forward network \(\mathrm{FFN}\) to the residual-stream activation \(x\), an MoE block keeps a pool of \(E\) expert FFNs \(\{E_1, E_2, \ldots, E_E\}\) together with a small linear router. For each token, the router produces a score vector, the top-\(k\) entries of that vector are kept, and the corresponding experts are evaluated. The remaining \(E - k\) experts contribute exactly zero on that token; their weights are touched only on the tokens routed to them, not on this one.

Following Mixtral's convention, write the router as a linear projection \(W_g \in \mathbb{R}^{E \times D}\) followed by a softmax over the top-\(k\) logits.

🎯 Intuition

Picture the router as a learned dispatcher. For each token, it glances at the activation, looks across all \(E\) experts, writes down the names of the top-\(k\) that look most promising, and the block runs only those experts on this token. The block's output is a weighted sum of those \(k\) expert outputs, with weights coming from the dispatcher's confidence scores.

The gate vector is

\[G(x) = \mathrm{Softmax}(\mathrm{TopK}(W_g x))\]

where \(\mathrm{TopK}(\ell)_i = \ell_i\) if \(\ell_i\) is among the top-\(k\) coordinates of the logit vector \(\ell\), and \(-\infty\) otherwise [src_025]. The block's output for input token \(x\) is then

\[y = \sum_{i \in \mathrm{TopK}} g_i \cdot E_i(x)\]

with \(g_i = G(x)_i\) the renormalised gate score for expert \(i\) (zero if expert \(i\) was not selected). Switch Transformer's earlier notation writes \(p_i(x)\) for the same gate value in equation (2) of Fedus et al., and the mathematics is identical [src_024]. We use \(g_i\) throughout to stay aligned with the Mixtral and DeepSeek-V3 papers.

Top-k routing through E experts: a learned router scores each token, the top-k experts are selected, their FFN outputs are weighted by router scores and summed back into the residual stream. Mixtral uses k=2 of 8; DeepSeek-V3 uses k=8 of 256 fine-grained experts.

Two consequences fall out of this construction immediately. First, because only \(k\) of \(E\) experts run per token, the FLOP cost of the block scales with \(k\) rather than with \(E\). Doubling \(E\) doubles the model's parameter count but does not change per-token compute. Second, because the gating coefficients \(g_i\) are produced by a softmax over the selected experts, the router stays differentiable, and gradients flow from the loss back through the gate values and into \(W_g\). Switch Transformer made the additional point that this remains true even at \(k = 1\), contradicting an earlier conjecture that \(k > 1\) was needed for non-trivial routing gradients [src_024].

In transformer-based MoE LLMs, the MoE block typically replaces the FFN sub-block of every Transformer layer (Mixtral) or every other layer (the GShard tradition) [src_025]. DeepSeek-V3 substitutes the FFN of every layer except the first three with an MoE block, and additionally splits the experts into a small set of always-on shared experts plus a much larger pool of routed experts [src_031]. The architectural decision of how many layers carry MoE and whether to keep shared experts is orthogonal to the routing equation above; both architectures still compute the same weighted sum of selected expert outputs.

3. Routing variants: top-1, top-2, top-k¶

Three routing regimes have all been deployed at scale. They are not exotic variants of one another; they correspond to three distinct points on a quality / compute / load-balance trade-off.

Top-1 (Switch Transformer, 2021). Each token is routed to exactly one expert. Switch's claim is that top-1 routing preserves quality, halves the per-token expert workload relative to top-2, and simplifies cross-device communication; the model becomes FLOP-matched to a dense baseline at the FFN sub-block while parameter count grows freely with \(E\) [src_024]. The benefits Switch enumerates are concrete: reduced router computation, expert capacity halved at fixed batch size, and a simpler implementation [src_024]. Switch's experiments use 16, 32, 64, 128, and 256 experts per layer.

Top-2 (Mixtral 8x7B, 2024). Each token activates two of eight experts; the experts are SwiGLU FFNs whose outputs are combined by their renormalised gate values [src_025]. Top-2 buys back the ability to mix two specialists per token at the cost of doubling the per-token FFN compute relative to top-1. Mixtral's results suggest the trade is favourable at this scale: 13B active parameters match or exceed Llama-2 70B across most benchmarks the authors evaluate [src_025]. Mixtral's eight-expert structure is also small enough to fit on a single multi-GPU server, which made the model unusually friendly to early open-source serving stacks.

🤔 Pause and reflect

Hold \(E\) fixed at 64. Before reading on, predict: what changes in per-token FFN compute and in the size of the combination space when \(k\) moves from 2 to 8? Which of those two quantities does the field appear to be willing to spend on? (Don't look ahead — answer first.)

Top-k for \(k > 2\) (DeepSeek-V3, 2024). DeepSeek-V3 keeps 256 fine-grained routed experts per layer plus 1 always-on shared expert, and activates 8 of the 256 routed experts for each token in addition to the shared one [src_031]. The fine-grained design is a separate idea from the routing \(k\): holding total expert capacity fixed but breaking it into more, smaller experts gives the router more degrees of freedom in combining specialists. With \(k = 8\) the per-token compute is higher than Mixtral's \(k = 2\), but each individual expert is correspondingly smaller, and the combination space (\(\binom{256}{8}\)) is large enough that the router can compose rather than merely choose. Qwen3-235B-A22B follows the same general pattern with 128 experts and top-8 routing.

The lever the field has been pulling hardest is increasing \(E\) at fixed \(k\) (or at slowly-growing \(k\)), because that is the direction that grows total parameters without growing per-token compute [src_025, src_026]. Increase \(E\) from 8 to 256 with \(k = 2\) and you have multiplied the model's total capacity by 32x while the number of experts each token visits is unchanged. The compute on that token is not unchanged, because the router itself is now scoring 256 candidates instead of 8, but the router's cost is dominated by a single \(D \times E\) linear layer and stays small relative to the FFN cost of the selected experts.

4. The load-balancing problem¶

Without intervention, MoE routing collapses. A small fraction of experts attract most of the tokens, the remaining experts go cold, and the model effectively becomes a much smaller dense model with most of its parameters serving as expensive paperweights. Switch Transformer, the MoE survey, and DeepSeek-V3 all describe this failure mode in similar terms [src_024, src_026, src_031]. The cause is a positive feedback loop: experts that get more tokens get more gradient signal, get better at processing the kinds of tokens they already see, and thus get scored higher by the router on those tokens; the gate's softmax sharpens on the same handful of indices. (This loop is closed by the same router-gradient pathway introduced in §2 — gradients flow from the loss back through \(g_i\) and into \(W_g\), so a router that routes well on the tokens it sees gets reinforced for that routing on similar tokens.)

Switch Transformer's response is the canonical auxiliary-loss approach. Adopting Switch's notation (the auxiliary loss was authored with \(N\) for expert count, so we write \(N\) in this section as an alias for our \(E\)), and given a batch \(B\) of \(T\) tokens, Switch defines

\[f_i = \frac{1}{T} \sum_{x \in B} \mathbb{1}\{\arg\max_j p_j(x) = i\}\]

as the fraction of tokens routed to expert \(i\), and

\[P_i = \frac{1}{T} \sum_{x \in B} p_i(x)\]

as the average router probability mass on expert \(i\) across the batch [src_024]. Both vectors live on the simplex over experts; both have value \(1/N\) under perfectly uniform routing. The auxiliary loss is the scaled dot product of the two:

🎯 Intuition

Why a dot product detects imbalance: if both \(f\) (where tokens actually went) and \(P\) (where the router wanted them to go) pile mass on the same expert, the dot product spikes. Distributions that are uniform over experts minimise \(\sum_i f_i P_i\) subject to the simplex constraint. So the loss is large exactly when the router is committing to the same handful of experts that are receiving the bulk of the tokens — which is the collapse mode this section opened with.

\[\mathcal{L}_{\mathrm{aux}} = \alpha \cdot N \cdot \sum_{i=1}^{N} f_i \cdot P_i\]

This loss is added to the cross-entropy objective during training [src_024].

Three properties of the auxiliary loss¶

Three properties of the formula are worth pausing on. First, the loss is minimised when both \(f\) and \(P\) equal \(1/N\) uniformly, so the gradient pushes the system toward balanced routing. Second, the multiplicative \(N\) on the right-hand side keeps the loss's value comparable across choices of \(N\): under uniform routing the bare sum is \(\sum_i (1/N)(1/N) = 1/N\), and multiplying by \(N\) normalises this to a constant. Third, only the \(P\) vector is differentiable as written (\(f_i\) contains a non-differentiable \(\arg\max\)), but the gradient through \(P\) alone is enough to shape the router. Switch sweeps \(\alpha\) from \(10^{-1}\) down to \(10^{-5}\) and settles on \(\alpha = 10^{-2}\) as small enough not to overwhelm the cross-entropy signal but large enough to keep the load balanced [src_024]. The Mixtral, GShard, and ST-MoE family inherits a structurally similar formulation [src_025, src_026].

💡 Key result

Adding the dot-product penalty \(\alpha N \sum_i f_i P_i\) to cross-entropy is the canonical industrial cure for routing collapse — it pushes the router toward uniform load while the cross-entropy gradient still shapes which expert handles which token.

🔄 Recap

Complete. The auxiliary loss is \(\mathcal{L}_{\mathrm{aux}} = \alpha \cdot N \cdot \sum_i \_\_\_ \cdot \_\_\_\), where the two vectors are the empirical ___-of-tokens-per-expert and the average router ___-mass-per-expert.
Explain. Why does the loss's minimum sit at \(f_i = P_i = 1/N\)? Why is one of the two vectors not differentiable, and why doesn't that block the gradient from doing useful work?
Predict. A team uses an auxiliary-loss coefficient \(\alpha\) so large that the routing is perfectly uniform but downstream cross-entropy stops improving. Which of the two objectives is the loss favouring, and what should they change?

5. DeepSeek-V3's auxiliary-loss-free trick¶

The auxiliary loss works, in the sense that it produces balanced routing. The objection DeepSeek-V3 raises against it is that an auxiliary loss strong enough to balance the load also interferes with the cross-entropy gradient on the actual language-modelling task: the router is being asked simultaneously to produce good routes and to distribute the routing budget evenly, and these two objectives can fight [src_031]. The proposed fix, due to Wang et al. and adopted by DeepSeek-V3, is to take load balance out of the gradient entirely.

A notational shift accompanies this fix. §2's gate was a softmax over the top-\(k\) logits of \(W_g x\) — a coupled distribution where the score for one expert depends on the scores for the others. §5 below uses an independent sigmoid per expert on the affinity \(u_t^T e_i\), so each expert gets its own scalar score in \([0, 1]\) and the top-\(k\) is taken on those scalars. The two are different gate-function families with the same routing semantics — top-\(k\) selection followed by a weighted sum — but the per-expert independence of the sigmoid is what makes the bias-update trick clean to implement: shifting \(b_i\) on one expert moves only that expert's score relative to the others.

DeepSeek-V3's MoE block computes a per-token affinity score \(s_{i,t} = \mathrm{Sigmoid}(u_t^T e_i)\) for each routed expert \(i\), where \(e_i\) is the expert's centroid vector (a learned per-expert vector that the router uses as a query target) and \(u_t\) is the layer's input for token \(t\) [src_031]. To balance load without a loss term, DeepSeek-V3 adds a per-expert bias \(b_i\) to the affinity score at routing time only:

\[\text{biased\_score}_i = s_{i,t} + b_i\]

Top-\(k\) selection is performed against this biased score, but the gate value that actually weights the expert's output in the residual sum is still derived from the unbiased \(s_{i,t}\) [src_031]. So the bias steers the routing without distorting the contribution of any selected expert.

The biases are not learned by gradient descent. An external controller monitors per-expert load over the whole batch at each training step. At the end of each step, \(b_i\) is decreased by a constant \(\gamma\) (the bias update speed) for every overloaded expert and increased by \(\gamma\) for every underloaded one [src_031].

🎯 Intuition

The bias is a thermostat. When expert \(i\) runs hot — overloaded this batch — nudge its bias down so the router prefers it less next batch. When it runs cold, nudge it up. The corrective signal lives entirely in the bias state, never in the cross-entropy gradient, so balancing and language modelling stop fighting each other.

DeepSeek-V3 sets \(\gamma = 0.001\) for the first 14.3T tokens and zeroes it out for the last 500B tokens of pretraining [src_031]. The same paper retains a complementary sequence-wise balance loss with an extremely small \(\alpha\) to prevent intra-sequence pathologies, but the heavy lifting on batch-level balance is done by the bias update [src_031].

The mechanism is elegant for the same reason batch normalisation was: the corrective signal is computed from a moving estimate of a batch-level statistic, so the policy adjusts toward balance without any of the imbalance leaking into the cross-entropy gradient. DeepSeek-V3 reports stable training over 14.8T tokens with no irrecoverable loss spikes and no rollbacks, and credits the result jointly to MLA, auxiliary-loss-free MoE, FP8 mixed precision, and the DualPipe schedule [src_031]. The contribution this section credits is specifically the load-balancing scheme; the other three components are surveyed in Chapter 8 of this book.

💡 Key result

Load balance can be enforced by an external thermostat on per-expert biases, so cross-entropy gradients carry no balancing burden — and the resulting training is stable at frontier scale.

🔗 Connection

MLA, FP8 mixed precision, and DualPipe — the three other DeepSeek-V3 components named in this section — are surveyed alongside DeepSeek-V3's other architecture deltas in Chapter 8 — Inside a Modern Decoder-Only LLM.

6. Expert capacity and dropped tokens¶

A subtlety the routing equation hides is that production MoE implementations often need fixed-shape tensors. If expert \(i\) ends up processing 137 tokens this batch and the kernel was compiled for 128, the system has to either pad or drop. The standard solution is to declare a per-expert capacity in advance and have the router cooperate.

🎯 Intuition

Imagine you compiled a kernel that processes exactly 32 tokens per expert. Until the router cooperates, expert \(i\) might receive 47 tokens this batch (overflow — the extra 15 must go somewhere) or 19 (underflow — pad 13 slots with zeros). Both are bad: overflow means dropped work, underflow means wasted memory and FLOPs. The capacity factor is the relaxation of that fixed slot — a small over-allocation that buys tolerance to imbalance at a controlled memory cost.

Switch Transformer formalises this by setting expert capacity as

\[\mathrm{capacity} = \left\lceil \frac{\mathrm{tokens\_per\_batch}}{N} \cdot c \right\rceil\]

where \(c\) is the capacity factor, a hyperparameter typically slightly greater than \(1.0\) [src_024]. A capacity factor above \(1.0\) provides a buffer for routing imbalance at the cost of memory and compute on padded slots. When more than \(\mathrm{capacity}\) tokens want to route to expert \(i\), the overflow tokens are dropped: their FFN output for that layer is simply replaced by their own residual passthrough — the token bypasses the MoE block and re-enters the residual stream unchanged — as if the MoE block were the identity for them [src_024]. Switch reports drop rates below 1% in practice when the auxiliary loss has done its job [src_024].

🤔 Pause and reflect

Set \(c = 1.0\) and assume routing is moderately imbalanced — say the most-loaded expert this batch receives 1.4× its share. Predict, before reading on: roughly what fraction of tokens get dropped, and what does pushing \(c\) to \(1.5\) buy in exchange for what cost? (Answer in your head before turning to the next paragraph.)

Capacity factor is itself a knob with two-sided cost. A higher \(c\) is more forgiving of imbalance and reduces drop rates but inflates memory and compute. A lower \(c\) is more efficient but more brittle. Switch's experiments use values in the range 1.0 to 2.0, with 1.25 a common default in the MoE survey's catalogue [src_024, src_026]. DeepSeek-V3 sidesteps the question entirely: the auxiliary-loss-free balancing is good enough that no tokens are dropped during training or inference, and the implementation does not need a capacity buffer [src_031].

The two-line code sketch below shows the capacity arithmetic explicitly.

import math

def expert_capacity(tokens_per_batch: int, num_experts: int,
                    capacity_factor: float = 1.25) -> int:
    """Static per-expert slot count used at compile time."""
    return math.ceil(capacity_factor * tokens_per_batch / num_experts)

# Switch-style example: 4096 tokens per batch, 128 experts, c=1.25
# -> each expert is sized for ceil(1.25 * 4096 / 128) = 40 tokens
print(expert_capacity(4096, 128, 1.25))  # 40

🔄 Recap

Complete. Switch's capacity formula is \(\mathrm{capacity} = \lceil c \cdot \_\_\_ / \_\_\_ \rceil\), where \(c\) is the ___-factor.
Explain. Why do MoE kernels need a per-expert capacity declared in advance? What happens to overflow tokens, and why does that mean §4's auxiliary loss is doing systems work as well as quality work?
Compare. Switch sets \(c \approx 1.25\) and tolerates a small drop rate. DeepSeek-V3's auxiliary-loss-free balancing keeps the drop rate at zero. Which problem does each architecture choose to live with — and which does each refuse?

7. Expert parallelism: the systems picture¶

At training and inference scale, the \(E\) experts in an MoE layer do not fit on one device. Expert parallelism shards the experts across GPUs (or, at larger scale, across nodes), and each token must be communicated to whichever device hosts the expert(s) it was routed to [src_024, src_025, src_026]. The standard collective for this is all-to-all: every device sends a slab of tokens to every other device, in proportion to how many tokens it has routed there, then every device sends the expert outputs back along the reverse path. (For readers whose distributed-training background is shallow: all-to-all is distinct from all-reduce, which sums tensors across devices, and from all-gather, which concatenates them — all-to-all permutes tokens across devices, and is the natural collective when the index of where each token must land depends on a per-token routing decision.) All-to-all latency is the systems cost MoE pays for parameter efficiency.

Three engineering consequences follow. First, load balance is not just a quality issue; it is a wall-clock issue. An underloaded expert means the GPU that hosts it sits idle while peers finish, and the slowest GPU sets the step time. Mixtral specifically calls out that expert parallelism introduces load-balance pressure on the engineering side, not only on the modelling side [src_025]. Second, all-to-all bandwidth scales with the number of nodes involved, so cross-node MoE training amplifies communication costs rapidly with cluster size. DeepSeek-V3 introduces node-limited routing to bound this: each token reaches at most \(M\) nodes, where \(M = 4\) in the DeepSeek-V3 configuration, with the constraint that the top-\(M\) nodes are chosen by summed affinity score across each node's experts [src_031]. Third, the schedule of when to do expert all-to-all matters: DeepSeek-V3's DualPipe pipeline-parallelism algorithm is built specifically to overlap forward and backward computation with the MoE all-to-all, making the cross-node MoE communication cost nearly hideable [src_031].

The Ultra-Scale Playbook from Hugging Face's Nanotron team discusses expert parallelism alongside data, tensor, sequence, and pipeline parallelism as one of the parallelism axes used to ship modern frontier models, and frames the engineering picture in those terms [src_007]. Stanford CS336 covers the same material at lecture pace [src_004]. Grigorov's Building LLMs from Scratch (Apress, 2026) treats the engineering / PyTorch / CUDA-kernel angle on expert sharding as a complementary perspective to the algorithmic treatment we give here [src_047].

🔗 Connection

The full systems-level treatment of MoE — parallelism stacks (data, tensor, sequence, pipeline, expert), fused kernels, FP8 mixed precision, and the all-to-all schedule — sits outside this chapter's algorithmic scope. The three references above (the Ultra-Scale Playbook, Stanford CS336, and Grigorov 2026) are the natural next reads for that material.

8. Total versus active parameters¶

Two parameter counts matter for an MoE model, and the difference between them is the architecture's headline value proposition.

Total parameters is what the model has on disk and in GPU memory. It scales with \(E\) and dominates serving cost: even though only \(k\) of \(E\) experts process any given token, all \(E\) are loaded so any of them can be selected on the next token. Active parameters is what processes a single token, scales with \(k\) (and with the size of any always-on shared expert), and dominates per-token FLOPs and inference latency at small batch sizes.

The empirical comparison Mixtral draws is the cleanest example of the trade-off [src_025]:

Model	Total params	Active params per token	Notes
Llama-2 70B (dense)	70B	70B	every parameter touched per token
Mixtral 8x7B	47B	13B	8 experts, top-2 routing
DeepSeek-V3	671B	37B	256 routed + 1 shared, top-8

🤔 Pause and reflect

Looking only at the table: will Mixtral 8x7B be cheaper or more expensive to serve than Llama-2 70B? Decide separately for per-token compute and for total memory footprint, and only after committing to those answers, read the next paragraph.

Mixtral matches or exceeds Llama-2 70B across most benchmarks while using roughly 5x fewer active parameters during inference [src_025]. The memory cost to serve Mixtral is set by the 47B total, which is still smaller than Llama-2 70B's full 70B. DeepSeek-V3 pushes the same lever harder, multiplying total capacity by an order of magnitude over Llama-2 70B while keeping active parameters comparable [src_031].

The trade-off has a sharp edge. The smart comparison is active parameters for compute, total parameters for capability [src_025, src_026]: a 13B-active model is priced like a 13B dense model on per-token FLOPs and per-token KV-cache, but it has access to a 47B-parameter representation when it matters. The qualifier when it matters is doing real work: only the right two of eight experts are queried, so the gating network has to actually learn to specialise. Mixtral's own routing analysis on The Pile validation set finds that experts do not specialise by topic in the human-readable sense (philosophy texts and arXiv papers are routed similarly), but they do specialise on syntactic patterns and exhibit positional locality, with consecutive tokens often routed to the same experts at higher layers [src_025].

💡 Key result

Active parameters set per-token compute, total parameters set the capability ceiling — these decouple in MoE in a way they cannot in a dense Transformer.

9. Why MoE became default at the open frontier¶

Three things had to be true at once for MoE to take over.

Inference compute per token stays bounded as you grow capacity. This is the architectural fact. Over a deployed model's lifetime, inference is run orders of magnitude more often than training. If serving cost dominates training cost over a model's lifetime, and if active parameters drive serving cost, then growing total parameters at fixed active parameters is a Pareto improvement on dense [src_025, src_031]. The same logic that drove Llama-3 to over-train its smaller variants past Chinchilla optimum (Chapter 9) drives MoE adoption here: in production, inference is amortised over many more tokens than training is.

🔗 Connection

The over-training regime introduced in Chapter 9 — Scaling Laws §6 (Post-Chinchilla) supplies the inference-cost-amortisation argument that this bullet invokes.

Open-weights MoE training caught up to dense. The MoE survey timeline shows the curve: Mixtral-8x7B (Jan 2024), DBRX, Grok-1, DeepSeek-V2, Qwen1.5-MoE, Hunyuan-Large, DeepSeek-V3, all open-weights MoE LLMs released through 2024 and into 2025 [src_026]. By the end of that wave the gap to closed-source flagship models had been substantially closed; DeepSeek-V3 in particular reports performance comparable to GPT-4o and Claude-3.5-Sonnet on a range of standard and open-ended benchmarks [src_031]. The Ultra-Scale Playbook frames the systems toolchain (parallelism stacks, fused kernels, FP8 mixed precision, expert-parallel all-to-all) as the maturing infrastructure that made this possible [src_007].

Auxiliary-loss-free balancing closed the training-stability gap. Switch Transformer had identified instability and load-balance interference as twin obstacles back in 2021 [src_024]. The DeepSeek-V3 contribution showed that the load-balance objection could be addressed without compromising the cross-entropy gradient, and that the resulting training was stable over 14.8T tokens with no rollbacks [src_031]. That is a concrete falsification of the long-standing claim that MoE was inherently more brittle than dense at frontier scale.

⚠️ Pitfall

MoE serving is harder than dense serving — total-parameter footprint sets memory pressure, expert-parallel routing tolerates batched workloads better than interactive single-token ones, and Llama-3's dense choice for the 405B flagship is the live counter-example.

The honest caveat: MoE serving is harder than dense serving. The total-parameter footprint is what determines memory pressure, expert-parallel routing introduces overhead that batched workloads tolerate better than single-token interactive ones, and the Mixtral paper specifically notes that MoE layers are best suited to batched serving where the per-batch routing cost amortises into high arithmetic intensity [src_025]. Llama-3 explicitly chose dense over MoE for the 405B flagship, citing training stability [src_026]. So MoE is the default at the open frontier as of 2025, but it is not a foregone conclusion: the choice depends on the team's risk tolerance, the deployment profile, and how much engineering investment in the all-to-all path the project can afford. The Ultra-Scale Playbook treats both architectures as live options inside the same parallelism stack [src_007].

🔄 Recap

Complete. Three things had to be true at once for MoE to take over: ___-compute-per-token stays bounded as capacity grows; open-weights MoE training caught up to dense; and ___-balancing closed the training-stability gap.
Compare. Why does the same Llama-3 team that pushed over-training past Chinchilla optimum (Ch.9) choose dense for the 405B flagship rather than MoE? What is each choice optimising?
Predict. A team has chosen MoE for an open-weights frontier model and a chat product. Their main deployment profile is interactive single-token serving. Which of the three "had-to-be-true" enablers is their weakest, and what should they invest in to compensate?

10. Closing¶

This chapter sits at the seam between architecture (Part 4) and post-training (Part 6). DeepSeek-V3 reappears in Chapter 8's architecture-comparison table as the canonical fine-grained-MoE frontier model, and again in Chapter 13 on reasoning models because DeepSeek-R1 inherits the V3 base and applies reinforcement learning with verifiable rewards on top of it. The MoE machinery developed here is the common substrate for both threads.

🔗 Connection

DeepSeek-V3's MoE base reappears in two later chapters: as a frontier-architecture entry in Chapter 8 — Inside a Modern Decoder-Only LLM, and as the pretrained substrate for DeepSeek-R1 in Chapter 13 — Reasoning Models.

What you should take away: an MoE block is the same FFN slot a dense Transformer has, parameterised differently, plus a learned router, plus a load-balancing strategy. The router equation \(y = \sum_{i \in \mathrm{TopK}} g_i \cdot E_i(x)\) is mechanically simple. The load-balancing strategy is what made the architecture industrially viable, and the move from auxiliary-loss to auxiliary-loss-free balancing is what made it competitive with dense at the frontier. The systems cost is real but pays back at deployment.

References¶

[src_004] Hashimoto, T., & Liang, P. (2025). Stanford CS336: Language Modeling from Scratch (Spring 2025). https://stanford-cs336.github.io/spring2025/
[src_007] Hugging Face. (2025). The Ultra-Scale Playbook: Training LLMs on GPU Clusters. https://huggingface.co/spaces/nanotron/ultrascale-playbook
[src_024] Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. https://arxiv.org/pdf/2101.03961
[src_025] Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., et al. (2024). Mixtral of Experts. https://arxiv.org/pdf/2401.04088
[src_026] Cai, W., Jiang, J., Wang, F., Tang, J., Kim, S., & Huang, J. (2024). A Survey on Mixture of Experts. https://arxiv.org/pdf/2407.06204
[src_031] DeepSeek-AI. (2024). DeepSeek-V3 Technical Report. https://arxiv.org/pdf/2412.19437
[src_047] Grigorov, D. (2026). Building Large Language Models from Scratch (Apress). https://doi.org/10.1007/979-8-8688-2297-1