Rotary Position Encoding¶

1. Why position has to be added at all¶

Self-attention, written down honestly, has no idea where in a sequence any of its tokens live. The dot products \(Q K^{\top}\) depend only on the contents of the query and key vectors, not on the indices that produced them, so a Transformer that takes a sequence as input and produces a sequence as output is permutation-equivariant in the absence of any auxiliary signal: shuffle the tokens, and the outputs are shuffled the same way [src_017, src_002]. For language modelling, this is a fatal property. The sentence "the cat sat on the mat" and the sentence "the mat sat on the cat" are made of the same tokens; the difference between them lives entirely in their order. Some mechanism has to inject position information into the computation, and the question is which mechanism.

Two broad strategies have been tried [src_017]. The first is additive: compute a position vector \(p_m\) for each absolute index \(m\), and add it to the token embedding before the first block. Vaswani et al.'s sinusoidal scheme and BERT's learned embedding table both belong to this family [src_017]. The second is multiplicative: leave the value pathway alone and modify the query and key vectors so that the dot product \(\langle q_m, k_n \rangle\) encodes the relative offset \(n - m\) directly. RoPE belongs to this second family, and it is the multiplicative idea — composed via rotation in feature space — that has displaced both additive variants from the frontier of decoder-only large language models in 2024 to 2026 [src_002, src_010, src_047].

This chapter follows the same notation table introduced in Chapter 1: \(B\) is the batch size, \(T\) the sequence length, \(D\) the residual-stream dimension, \(h\) the number of attention heads, and \(d_h = D / h\) the per-head dimension. RoPE is applied per head, and within each head it acts on the per-head query and key vectors of dimension \(d_h\). We will use \(d\) for the dimension being rotated when we want to be neutral about whether the discussion refers to per-head or full-model dimensions; in practice the rotated object is the per-head query or key, so \(d = d_h\) throughout the implementation discussion of Section 8.

2. The pre-RoPE landscape and why both predecessors are gone¶

The 2017 Transformer used sinusoidal absolute positional encoding: each absolute index \(m\) is mapped to a fixed vector whose \(2t\)-th and \((2t{+}1)\)-th components are \(\sin(m / 10000^{2t/d})\) and \(\cos(m / 10000^{2t/d})\) respectively, and that vector is added to the token embedding before the first block [src_017]. The geometric progression of frequencies, from a wavelength of \(2\pi\) at the most rapid coordinate to roughly \(10000 \cdot 2\pi\) at the slowest, was chosen because the authors hypothesised that the model could learn to attend to relative offsets through the linear-shift property of sinusoids [src_017]. The hope was that sinusoidal encodings would also extrapolate cleanly to sequence lengths beyond the training distribution.

In practice, the extrapolation hope did not hold up. Sinusoidal encodings degrade outside the training-length window in ways that cannot be patched without retraining, and the encoder-only family that followed the original Transformer adopted a different scheme: a learned absolute position embedding, in which \(p_m\) is a row of a trainable lookup table of shape \(T_{\max} \times D\) [src_017, src_002]. BERT and its successors used this convention, and it has the obvious limitation that the table only contains rows for positions seen in pre-training; any position beyond \(T_{\max}\) has no representation at all [src_002].

A third family, relative position biases, attempted to fix the asymmetry by injecting position information directly into the attention logits rather than into the token embedding. T5 buckets the offset \(i - j\) into a small set of learned scalar shifts that are added to the pre-softmax logit; ALiBi adds a linear penalty \(-\beta \cdot (i - j)\) with per-head slopes \(\beta_k = 1 / 2^{8k / n_{\text{head}}}\) [src_002]. Both schemes yield genuine relative-position behaviour and both extrapolate beyond the training-length window in a way that absolute schemes do not. Neither, however, achieves the property RoPE makes available almost for free: a multiplicative encoding that preserves vector norms, composes cleanly with linear attention, and induces a long-term decay structure that matches the intuition that distant tokens should interact less [src_017].

By 2024 the frontier of open-weight decoder-only language models had converged on RoPE. Llama, Qwen, Gemma, and DeepSeek all use it; ModernBERT, the encoder-only revival, uses it too [src_002, src_010, src_047]. Sinusoidal absolute and learned absolute encodings survive in the historical record but have effectively disappeared from new decoder-only architectures.

3. RoPE in two dimensions: the complex-plane derivation¶

🎯 Intuition

RoPE rotates the query and key vectors as a function of position — literally turning each vector by an angle proportional to its position, like the hands of a clock that tick once per token. When two rotated vectors meet inside a dot product, the absolute angles cancel and only the relative offset survives. The rest of this section turns that picture into algebra; the algebra is the consequence, not the goal.

The cleanest way to see why RoPE works is to start in two dimensions and use the complex plane. Suppose the per-head dimension is \(d = 2\), and let \(W_q, W_k \in \mathbb{R}^{2 \times 2}\) be the query and key projection matrices. The goal Su et al. set up is to find functions \(f_q\) and \(f_k\) such that the inner product of the position-encoded query at index \(m\) and the position-encoded key at index \(n\) depends only on the contents of the embeddings and the relative offset \(n - m\) [src_017]:

\[ \langle f_q(x_m, m), f_k(x_n, n) \rangle = g(x_m, x_n, n - m). \]

Identify the 2-vector \(W_q x_m \in \mathbb{R}^2\) with the complex number \(z_m^{(q)} = (W_q x_m)_1 + i \cdot (W_q x_m)_2\), and likewise for \(W_k x_n\). Multiplication by \(e^{i m \theta}\) in the complex plane is exactly a counter-clockwise rotation by angle \(m \theta\). Uniqueness here is not magic. The position-encoded forms must satisfy the relative-distance constraint and reduce to \(W_q x_m\), \(W_k x_n\) at \(m = n = 0\) (the no-position-encoding initial condition). Su et al. show that any function pair satisfying both is forced into the form \(z \, e^{i m \theta}\) up to a phase that the projection matrices can absorb. The full argument — a functional equation in the position variable — lives in RoFormer §3.4.1; we record only that the constraint plus the initial condition pins the form. The result is that the position-encoded query and key are [src_017, src_002]:

\[ f_q(x_m, m) = (W_q x_m) \cdot e^{i m \theta}, \qquad f_k(x_n, n) = (W_k x_n) \cdot e^{i n \theta}. \]

The inner product, computed as the real part of \(f_q(x_m, m) \cdot \overline{f_k(x_n, n)}\), then depends on the positions only through their difference [src_017]:

\[ \mathrm{Re}\bigl[(W_q x_m)(W_k x_n)^{*} e^{i (m - n) \theta}\bigr] = g(x_m, x_n, m - n). \]

The mechanism is therefore a rotation: the query at position \(m\) is rotated by \(m \theta\), the key at position \(n\) is rotated by \(n \theta\), and when the two rotated vectors are dot-producted together, the rotations partially cancel and leave behind only the relative-offset rotation \((m - n) \theta\) [src_017].

Engineers prefer to write the same operation as a matrix multiplication. The complex-plane multiplication \(z \mapsto z \cdot e^{i m \theta}\) is identical to the real-valued \(2 \times 2\) matrix-vector product

\[ R_{m \theta} = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix}, \qquad R_{m \theta} (W_q x_m) = (W_q x_m) \cdot e^{i m \theta}. \]

Both forms compute the same thing; the complex form makes the relative-distance proof transparent in two lines, while the real form is what an implementation actually produces in a vectorised kernel. We will use the complex form whenever a derivation is at stake and the real form whenever an implementation is at stake.

4. The relative-position property as the punchline¶

The point of the construction is not the rotation itself; the point is what the rotation does to the inner product. Plug the position-encoded \(q_m\) and \(k_n\) into the attention formula and read off what the dot product depends on [src_017]:

\[ q_m^{\top} k_n = (R_{m\theta} W_q x_m)^{\top} (R_{n\theta} W_k x_n) = x_m^{\top} W_q^{\top} R_{(n - m)\theta} W_k x_n, \]

Three steps are folded into the chained equality. (i) Distribute the transpose: \((R_{m\theta} W_q x_m)^{\top} = x_m^{\top} W_q^{\top} R_{m\theta}^{\top}\). (ii) Use \(R_{m\theta}^{\top} = R_{-m\theta}\) (orthogonal rotation, so the transpose equals the inverse). (iii) Compose \(R_{-m\theta} R_{n\theta} = R_{(n - m)\theta}\), the closing identity that uses \(R_{m\theta}^{\top} R_{n\theta} = R_{(n - m)\theta}\) because the rotation matrices form a one-parameter abelian subgroup of \(\mathrm{SO}(2)\) [src_017].

⚠️ Pitfall

The relative-position result holds for the inner product, not for the individual rotated vectors. Two queries at positions \(m_1, m_2\) with the same content but different positions still produce different rotated representations \(R_{m_1\theta}(W_q x), R_{m_2\theta}(W_q x)\); only when those representations meet a key inside a dot product do the absolute rotations cancel.

🤔 Pause and reflect

Before reading on — can you state, in one sentence and without looking at the equation, what happens to the inner product \(q_m^{\top} k_n\) when you replace \(m, n\) with \(m+\Delta, n+\Delta\)? Why does the value pathway not appear in the answer? (Do not look ahead — write the answer down or say it out loud.)

Two consequences follow at once. First, the attention logit between the query at position \(m\) and the key at position \(n\) is a function of the embeddings together with the relative distance \(n - m\), exactly as the original formulation in Equation \(g(x_m, x_n, n - m)\) required [src_017]. Second, this relative-position behaviour is achieved without modifying the value pathway at all: \(V\) is left untouched, and the encoding lives entirely in the query-key inner product [src_017, src_002].

Figure 1 visualises the geometry. The query at position \(m\) and the key at position \(n\) each sit on a circle in the complex plane; their respective rotations by \(m \theta\) and \(n \theta\) leave the radii (the vector norms) invariant, and the angle between them after rotation differs from the angle before rotation by exactly \((n - m) \theta\). The attention logit depends on this rotated angle, not on the absolute angles, so any pair of positions \((m, n)\) with the same offset produces the same positional contribution to the logit.

RoPE rotates query and key vectors as functions of position; the inner product depends only on relative offset

This is the relative-position-property statement that Su et al. proved in Section 3.4 of the RoFormer paper, and it is the property that the rest of the chapter trades on [src_017]. Everything that follows — the generalisation to higher dimensions, the choice of frequencies, the long-term decay, the context-extension tricks, the use in modern open-weight models — is a consequence of preserving this property and choosing the rotation angles well.

💡 Key result

Rotating \(q\) at position \(m\) and \(k\) at position \(n\) leaves their inner product depending only on the relative offset \(n - m\) — and the value pathway is untouched.

🔄 Recap

Complete: in \(q_m^{\top} k_n = x_m^{\top} W_q^{\top} R_{?} W_k x_n\), what fills the subscript?
Explain: why does the value pathway not appear in the relative-position result, even though attention multiplies queries, keys, and values?
Predict: if \(\theta = 0\) (no rotation at all), what does the inner product reduce to, and what does that recover?

🔗 Connection

Chapter 4 takes the relative-position property up at implementation level: how RoPE composes with the KV-cache, what FlashAttention does to the rotation kernel, and why GQA/MQA preserve the property unchanged.

5. Generalisation to \(d\) dimensions: a block-diagonal stack of 2D rotations¶

🎯 Intuition

We pair feature dimensions because \(\mathrm{SO}(2)\) — the group of 2D rotations — is the simplest abelian rotation group, and that abelian-ness is what made the relative-position cancellation work in two dimensions. Three-dimensional rotations (\(\mathrm{SO}(3)\)) do not commute, and the cancellation breaks. Pairing is therefore the smallest construction that lets the 2D argument lift cleanly to higher dimensions.

A real attention head has \(d_h\) dimensions, not two, and we need to lift the construction. The trick Su et al. use is straightforward: for an even \(d\), partition the \(d\) feature dimensions into \(d/2\) consecutive pairs, treat each pair as its own complex plane, and apply a 2D rotation in each pair with its own angular frequency \(\theta_j\) [src_017]. Per pair \(j \in \{1, \ldots, d/2\}\), define the complex number

\[ z_m^{(j)} = q_m^{(2j - 1)} + i \cdot q_m^{(2j)}, \]

and rotate it by the position-dependent angle \(m \cdot \theta_j\):

\[ z_m^{(j)} \mapsto z_m^{(j)} \cdot e^{i m \theta_j}. \]

The full position-encoded query is the concatenation of the \(d/2\) rotated pairs, read back into real coordinates [src_017]. Equivalently, the \(d\)-dimensional rotation matrix \(R^{d}_{\Theta, m}\) is block-diagonal, with \(d/2\) blocks of the \(2 \times 2\) rotation form \(R_{m \theta_j}\) stacked along the diagonal [src_017]:

\[ R^{d}_{\Theta, m} = \mathrm{diag}\Bigl(R_{m \theta_1}, R_{m \theta_2}, \ldots, R_{m \theta_{d/2}}\Bigr). \]

Each block rotates only the two coordinates it owns; the blocks do not interact. Because \(R^{d}_{\Theta, m}\) is block-diagonal and orthogonal, it preserves vector norms, and because each block is itself a planar rotation, the per-pair relative-distance argument from Section 4 lifts to the full \(d\)-dimensional inner product [src_017]:

\[ q_m^{\top} k_n = \bigl(R^{d}_{\Theta, m} W_q x_m\bigr)^{\top} \bigl(R^{d}_{\Theta, n} W_k x_n\bigr) = x_m^{\top} W_q^{\top} R^{d}_{\Theta, n - m} W_k x_n. \]

The full \(d\)-dimensional dot product encodes only the relative offset \(n - m\), exactly as in the 2D case [src_017]. The property is inherited from the per-pair construction; there is no additional theorem to prove.

6. Frequency choice: why \(\theta_j = 10000^{-2(j-1)/d}\)¶

The remaining design decision is what to set the angular frequencies \(\theta_j\) to. Su et al. choose [src_017]:

\[ \theta_j = 10000^{-2(j - 1)/d}, \qquad j \in \{1, 2, \ldots, d/2\}. \]

🎯 Intuition

For \(d_h = 64\), four representative bands give the concrete handle: \(\theta_1 \approx 1\) rad/token (wavelength \(\approx 6\) tokens); \(\theta_8 \approx 0.18\) rad/token (wavelength \(\approx 35\) tokens); \(\theta_{16} \approx 0.032\) rad/token (wavelength \(\approx 200\) tokens); \(\theta_{32} \approx 10^{-4}\) rad/token (wavelength \(\approx 6 \times 10^4\) tokens). Fast bands resolve nearby tokens with fine granularity; slow bands distinguish positions thousands of tokens apart.

Read this as a geometric progression. At \(j = 1\), the frequency is \(\theta_1 = 1\) radian per position; at \(j = d/2\), it is \(\theta_{d/2} = 10000^{-(d - 2)/d} \approx 10000^{-1} = 10^{-4}\) radians per position. The first pair therefore completes a full rotation roughly every \(2\pi\) positions, while the last pair takes on the order of \(2\pi \cdot 10^4\) positions to complete a single rotation. The frequency progression is the same one Vaswani et al. used for sinusoidal absolute encodings, with the same base of 10000 and the same geometric spacing [src_017]; what differs is what the frequencies are used for. In RoPE they parameterise rotation angles, not additive sinusoidal components.

The geometric spacing has two virtues that justify the choice empirically and partially theoretically. First, it gives the network access to position information at multiple scales simultaneously: rapidly-rotating pairs resolve nearby tokens at fine granularity, while slowly-rotating pairs distinguish positions that are thousands of tokens apart. Second, with this schedule Su et al. prove a long-term decay property: a quantity bounding the magnitude of the inner-product contribution from a given relative offset \(|m - n|\) falls as \(|m - n|\) grows [src_017]. The decay matches the intuition that two tokens far apart in a sequence should, on average, contribute less to one another's representation than two adjacent tokens, and the matching is not enforced by hand — it falls out of the geometric frequency schedule [src_017]. Figure 1 illustrates the multi-scale character of the construction: the figure shows the rotation through one band, but the full rotary encoding stacks \(d/2\) such rotations, with the bands \(\theta_1, \theta_8, \theta_{16}, \theta_{d/2}\) giving a representative sweep from fast to slow.

🤔 Pause and reflect

Look at the four bands in the Intuition callout above (\(\theta_1, \theta_8, \theta_{16}, \theta_{32}\) for \(d_h = 64\)). For two tokens 100 positions apart, which band's rotation difference \(|m-n| \theta_j\) is closest to a meaningful angle (say, \(\pi/2\)), and which band's rotation difference is too small to register? What does that tell you about which bands carry near-vs-far position information? (Do not look ahead — work it out in your head before continuing.)

A subtle but useful fact is that rotations preserve norms exactly, so the rotary encoding does not change the magnitude of any query or key vector — only the angles in each two-dimensional subspace [src_017]. This norm-preservation is what makes RoPE compatible with linear attention (attention variants whose cost grows linearly in sequence length, taken up in Chapter 4): the positivity-preserving feature maps that linear attention requires can be applied before the rotation without conflict, since the rotation will not alter their non-negativity [src_017]. In standard softmax attention this is a footnote; in linear-attention designs it is a load-bearing property.

💡 Key result

The geometric frequency schedule produces a long-term decay property: the inner-product contribution from a relative offset \(|m - n|\) falls as \(|m - n|\) grows, so distant tokens interact less — without that decay being put in by hand.

🔗 Connection

Chapter 4 develops linear attention proper: which positivity-preserving feature maps work, what the kernel-trick view gives you, and why RoPE's norm preservation is the property that makes the composition clean.

7. Variants for context extension: position interpolation, NTK-aware, YaRN¶

A model trained with RoPE up to context length \(T_{\text{train}}\) can be evaluated, in principle, at any longer context length, because the rotation matrix \(R^{d}_{\Theta, m}\) is defined for every integer \(m\). In practice the model's quality degrades sharply once the evaluation length exceeds the training length: positions \(m > T_{\text{train}}\) produce rotation angles that the network never encountered during training, and the attention logits become poorly calibrated in that regime [src_002]. Several techniques have been proposed to extend the usable context window without retraining from scratch.

Linear position interpolation (Kong & Chen 2023; the immediate predecessor of NTK-aware) is the simplest. Rescale every input position by the factor \(T_{\text{train}} / T_{\text{eval}}\) before computing the rotation, so that positions in the longer evaluation sequence are mapped back into the trained range [src_002]. The trick is cheap and works at modest extension factors, but it compresses the high-frequency rotation pairs the most severely, blurring the model's ability to resolve nearby tokens precisely [src_002].

🎯 Intuition

Picture the four bands again. The fastest band already rotates by \(\approx 1\) radian per token; rescaling positions by \(T_{\text{train}}/T_{\text{eval}} = 1/4\) shrinks that to \(\approx 0.25\) radians per token, smearing four originally-distinct positions into one rotation step. The slowest band starts at \(10^{-4}\) rad/token; the same rescaling barely changes its already-glacial rotation. NTK-aware exists exactly to leave the fast bands alone.

NTK-aware scaling addresses this asymmetry by adjusting the rotation base \(b\) (the 10000 in \(\theta_j = b^{-2(j-1)/d}\)) non-uniformly across frequency bands [src_002]. The high-frequency bands are left effectively untouched, so fine-grained position resolution is preserved, while the low-frequency bands are stretched to absorb the extension factor [src_002, src_017]. Reports from practitioners suggest that NTK-aware scaling extends usable context to perhaps 2 to 4 times the training length without serious quality loss [src_002].

YaRN (Yet another RoPE extensioN) refines NTK-aware scaling by combining the non-uniform frequency rescaling with an attention-temperature adjustment and a chunked piecewise extension schedule [src_002]. The full derivation requires arguments about the rate of change of the rotation across frequency bands and a careful matching of the loss profile after extension; we sketch the idea here and decline to derive it line by line. Readers who want the full treatment should consult the YaRN paper directly and Section 2.3.5.4 of the Xiao and Zhu monograph for a worked exposition [src_002].

ℹ️ Needs review

A primary citation for YaRN itself (Peng et al. 2024, arXiv:2309.00071) is not yet in the source pool of this book; the description above relies on the secondary treatment in [src_002]. A future contributor should attach the YaRN paper and tighten this paragraph. The same applies to the NTK-aware scaling proposal, which originated in a community discussion thread that is not directly citable in the present pool.

🔗 Connection

Context-extension methods (linear interpolation, NTK-aware, YaRN) are pretraining-time and inference-time decisions; Chapter 7 takes them up under the encoder-decoder lens, and Chapter 8 covers their use in modern decoder-only LLMs (Llama-3 long-context, Qwen-3 needle-in-haystack, DeepSeek-V3 1M-token claims).

8. RoPE in 2026: where it is actually used¶

By 2024 to 2026, RoPE has become the default position-encoding scheme of essentially every open-weight decoder-only large language model produced at the frontier. The Llama family (Llama, Llama-2, Llama-3) uses it; Qwen-2 and Qwen-3 use it; Gemma-2 and Gemma-3 use it; DeepSeek-V2 and DeepSeek-V3 use it; the ModernBERT encoder-only revival uses it [src_002, src_010, src_047]. Implementations differ in a few low-level conventions — most importantly the choice between the original Su et al. layout, which pairs adjacent feature coordinates \((x_1, x_2), (x_3, x_4), \ldots\), and the GPT-NeoX / Llama half-strided layout, which pairs \((x_1, x_{d/2 + 1}), (x_2, x_{d/2 + 2}), \ldots\) [src_017, src_010]. The two conventions are equivalent up to a permutation of feature dimensions and produce identical numerics once the projection matrices have absorbed the permutation, but they are not interchangeable at a single weight checkpoint, so practitioners must be careful to match the layout of any pre-trained weights they load.

⚠️ Pitfall

Permutation-equivalent layouts produce identical numerics only after the projection matrices \(W_q, W_k\) have absorbed the permutation. A weight checkpoint trained against the Su layout is not interchangeable with one trained against the GPT-NeoX layout, because \(W_q\) has been fit to a specific feature ordering — loading the wrong layout silently rotates the wrong feature pairs.

🔗 Connection

Chapter 8 covers the modern decoder-only LLM stack (Llama, Qwen, Gemma, DeepSeek) where RoPE is the standard position-encoding choice; the layout convention each family uses (Su-original vs GPT-NeoX) is documented there.

8.1 Beyond decoder-only: ModernBERT and 2D RoPE¶

The encoder-only re-emergence is worth a separate note. ModernBERT explicitly replaces the learned absolute position embedding of the original BERT with RoPE, which is the principal architectural change that lets ModernBERT extend its context window to 8192 tokens or more without re-training the position table [src_002, src_010]. The same architectural ingredient that was developed for autoregressive language models thus carried over cleanly into a bidirectional encoder, illustrating that RoPE's relative-position property is independent of the masking pattern used at attention time.

A second extension, useful for vision Transformers, applies RoPE independently along the two spatial axes of an image: the row index and the column index get their own frequency bands, and the rotations along the two axes are composed into a single rotation per token [src_002]. SigLIP-ViT and similar vision encoders adopt this 2D RoPE to inherit the same length-flexibility and long-term-decay properties as the 1D version. Chapter 5 takes vision Transformers up in detail; the present chapter records only that the same construction generalises.

🔗 Connection

ModernBERT (Chapter 7) ports RoPE into the encoder-only family; SigLIP-ViT and similar vision encoders (Chapter 5) use 2D RoPE per the spatial-axis decomposition sketched here.

9. PyTorch implementation sketch¶

The actual implementation of RoPE is short, and the operations that look matrix-flavoured in the formal treatment turn into a pair of element-wise products against precomputed cosine and sine tables [src_017, src_010]. The naïve dense \(R^{d}_{\Theta, m} x\) multiplication is \(O(d^2)\), but the block-diagonal sparsity reduces it to \(O(d)\), and a vectorised kernel computes the rotation in two element-wise multiplies and one element-wise add per token-feature [src_017].

The listing below uses the half-strided (GPT-NeoX) convention adopted by Llama and the Raschka reference notebooks [src_010]. Given a query (or key) tensor \(q\) of shape \((B, h, T, d_h)\), the function rotates each per-head pair \((q_{\cdot, \cdot, m, j}, q_{\cdot, \cdot, m, j + d_h/2})\) by the angle \(m \theta_j\). The same routine applies symmetrically to the key tensor.

import torch

def rope_freqs(d_h: int, max_seq_len: int, base: float = 10000.0,
               device: torch.device | str = "cpu") -> tuple[torch.Tensor, torch.Tensor]:
    """Precompute the cos and sin tables for RoPE.

    Returns tensors of shape (max_seq_len, d_h // 2), holding cos(m * theta_j)
    and sin(m * theta_j) for every position m in [0, max_seq_len) and every
    frequency-band index j in [0, d_h // 2).
    """
    # theta_j = base ** (-2 j / d_h), with j = 0, 1, ..., d_h/2 - 1.
    j = torch.arange(0, d_h, 2, device=device, dtype=torch.float32)
    theta = base ** (-j / d_h)                            # shape (d_h // 2,)
    m = torch.arange(max_seq_len, device=device, dtype=torch.float32)
    angles = torch.outer(m, theta)                        # shape (T, d_h // 2)
    return angles.cos(), angles.sin()


def apply_rope(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
    """Apply RoPE to a query or key tensor of shape (B, h, T, d_h).

    The half-strided pairing rotates feature j with feature j + d_h/2.
    """
    d_h = x.shape[-1]
    half = d_h // 2
    x1, x2 = x[..., :half], x[..., half:]                 # halves
    cos_t = cos[: x.shape[-2]].unsqueeze(0).unsqueeze(0)  # (1, 1, T, d_h/2)
    sin_t = sin[: x.shape[-2]].unsqueeze(0).unsqueeze(0)
    rotated_1 = x1 * cos_t - x2 * sin_t
    rotated_2 = x1 * sin_t + x2 * cos_t
    return torch.cat([rotated_1, rotated_2], dim=-1)

A few details are worth tracing. The cos and sin tables are precomputed once per model and indexed by absolute position \(m\) at every forward pass; they do not depend on the input tokens and are not learned. The apply_rope routine performs a \(2 \times 2\) rotation per pair \((x_1, x_2)\), in the half-strided convention that Llama adopts [src_010]. The same routine is called with separate inputs and the same tables for the query and the key tensors; the value tensor is left alone, consistent with the relative-position property derived in Section 4 [src_017]. In a complete attention layer, apply_rope would be invoked between the query/key projections and the dot-product softmax computation; the standard scaled dot-product attention from Chapter 1 then proceeds unchanged.

For readers who want a more elaborate, fully runnable end-to-end RoPE-equipped attention implementation, Raschka's open notebooks accompanying Build a Large Language Model (From Scratch) provide a Llama-style listing that integrates the rotation with multi-head attention and the residual stream [src_010]. The CS336 lecture set covers the same material from the curriculum side and includes a version that exercises the rotation as part of a complete from-scratch language-model training loop [src_004]. The Grigorov engineering text covers PyTorch-level kernel concerns and the CUDA-side implementation details that matter at training scale [src_047].

10. Summary and bridge to Chapter 3¶

RoPE is, when stripped to its essentials, a two-line idea: rotate the query at position \(m\) by an angle \(m \theta_j\) in each pair of feature dimensions, rotate the key at position \(n\) by \(n \theta_j\), and the inner product picks up only the relative-offset rotation \((n - m) \theta_j\) [src_017]. Everything in this chapter is a consequence of that one observation: the block-diagonal generalisation to \(d\) dimensions, the geometric frequency schedule, the long-term decay property, the variants for context extension, the implementation as element-wise products against precomputed tables, and the empirical fact that essentially every frontier open-weight Transformer in 2024 to 2026 has converged on RoPE rather than on either of its additive predecessors [src_002, src_010, src_017, src_047].

What this chapter has changed in the standard Transformer block is exactly one thing: the position-encoding mechanism. The residual stream, the feed-forward sub-layer, and the layer normalisation are still doing what Chapter 1 said they were doing. Chapter 3 takes up the next two replacements: the layer normalisation is swapped for RMSNorm, and the ReLU-based feed-forward block is swapped for a gated SwiGLU variant. By the end of Chapter 3 we will have a modern decoder block — pre-RMSNorm wrapping a RoPE-augmented attention sub-layer, then pre-RMSNorm wrapping a SwiGLU FFN sub-layer, with residual connections around each — that matches the architecture used by Llama-3, Qwen-3, Gemma-3, and DeepSeek-V3 [src_002, src_010, src_047].

🔄 Recap

Complete: what is the rotation angle assigned to position \(m\) in pair \(j\), in terms of \(\theta_j\)?
Explain: why is the value pathway untouched by RoPE, and what would change if the value were also rotated?
Predict: for \(d_h = 64\), which frequency band carries the longest-range position information — and roughly how many tokens does its wavelength span?
Compare: in one sentence each, what distinguishes the original Su layout from the GPT-NeoX half-strided layout?

References¶

src_002 — Tong Xiao and Jingbo Zhu. Foundations of Large Language Models. arXiv:2501.09223v2, 2025. https://arxiv.org/pdf/2501.09223
src_004 — Tatsunori Hashimoto and Percy Liang. Stanford CS336: Language Modeling from Scratch (Spring 2025). Stanford University, 2025. https://stanford-cs336.github.io/spring2025/
src_010 — Sebastian Raschka. Build a Large Language Model (From Scratch). Manning, 2024. https://github.com/rasbt/LLMs-from-scratch
src_017 — Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864v5, 2021 (revised 2023). https://arxiv.org/pdf/2104.09864
src_047 — Dilyan Grigorov. Building Large Language Models from Scratch. Apress, 2026. https://doi.org/10.1007/979-8-8688-2297-1