A trick hiding in plain sight
In 2021 a small idea slipped into the transformer with almost no fanfare. A paper called RoFormer proposed encoding a token’s position not by adding something to it, but by rotating it. The idea, Rotary Positional Encoding (RoPE), spread quickly. Within two years it had become the default in nearly every serious open model: GPT-NeoX, PaLM, LLaMA and its descendants, Mistral, Qwen, DeepSeek, Gemma. If you have used a modern LLM, you have used RoPE.
It is also the quiet reason your model can sometimes read a whole book. Almost any conversation about long context, whether that means 128K windows, million-token prompts, or the needle-in-a-haystack test, eventually runs into RoPE. RoPE is the thing you have to stretch to make long context work, and the thing that breaks when you stretch it wrong.
This is where to start. Before we can talk about how models reach for longer and longer context, the subject of the posts that follow this one, we need to actually understand the small, elegant trick at the heart of it. By the end of this post, you should be able to look at this formula.
$$\big(R(m\theta)\,q\big)^{\!\top}\big(R(n\theta)\,k\big) \;=\; q^{\top} R\!\big((n-m)\theta\big)\, k$$and find it obvious.
We will keep one example in hand the whole way: the sentence “the dog chased the cat,” and in particular how the word chased should relate to the word dog.
The problem: attention sees an unordered bag
The attention mechanism has an uncomfortable property. By itself, it has no idea what order the words came in.
Attention works by comparing every token with every other token and taking weighted sums, and a sum does not care about order. Shuffle the inputs and you get the same answer back. To raw attention, “the dog chased the cat” and “the cat chased the dog” are the same bag of vectors. One sentence has the dog doing the chasing and the other has it being chased, and attention cannot tell them apart.
That is a problem, because meaning lives in order. We have to inject position somehow, giving the model a way to know that in our sentence, chased sits one step after dog, and that this adjacency is part of what the sentence means.
The obvious fix, and its hidden flaw
The original 2017 Transformer solved this the obvious way. It built a position vector out of sines and cosines and added it onto each word’s embedding, like stamping a timestamp onto a letter before you mail it. The token at position 0 gets stamp #0, position 1 gets stamp #1, and so on. The network is then left to untangle “content plus stamp” back into “content” and “where.”
It helps to see those sines and cosines directly. Each row below is one dimension oscillating at its own frequency, fast at the top and slow at the bottom. Move the slider, and the column the cursor marks out is the stamp that gets added to the word sitting at that position:
Warm cells are positive, cool cells negative. The column of cells stacked at the cursor is the position's encoding vector, the "stamp" added to whatever word sits there.
Two things stand out. The stamp depends only on the position, never on the word underneath it. And neighbouring positions get very similar stamps, since the waves move smoothly. Hold that picture, because it is what rotation is about to improve on.
It works, but it carries two flaws that, once you see them, motivate everything RoPE does.
Flaw one: it smears content and position together. Adding a vector moves the point. A word’s embedding encodes its meaning; the stamp we add shifts that vector somewhere new, so the same word at two different positions becomes two genuinely different vectors, with different length and direction. Meaning and position now sit tangled in the same numbers, and the model has to spend capacity pulling them back apart.
Flaw two: it is absolute, but attention wants relative. The stamp records where a token sits counting from the start of the sequence. That is almost never what matters. What matters for chased is that dog is one token back, not that dog happens to be the second word in this particular sentence. Prepend the word “Yesterday,” to our sentence and every absolute position shifts by one, yet the relationship between chased and dog has not changed at all. Absolute encodings force the model to learn how to turn “position 2 versus position 3” into “one step apart,” and to relearn that for every pair of positions. It is work we should not have to do.
Keep both flaws in mind. RoPE fixes them at once, with a single geometric move.
The insight: position is not a number you add, it’s a rotation you apply
There is another way to think about it. Instead of adding a position vector, what if we rotated the token’s vector by an angle that grows with its position?
Take the token’s vector and chop it into pairs of coordinates. Each pair is just a point on a plane, an arrow from the origin. To encode position $m$, spin that arrow by an angle $m\theta$: position 0 gets no turn, position 1 turns by $\theta$, and position $m$ turns by $m\theta$. In two dimensions this is exactly the rotation matrix from high-school geometry:
$$\begin{bmatrix}x'\\ y'\end{bmatrix} =\underbrace{\begin{bmatrix}\cos m\theta & -\sin m\theta\\[2pt] \sin m\theta & \;\;\cos m\theta\end{bmatrix}}_{R(m\theta)} \begin{bmatrix}x\\ y\end{bmatrix}$$Drag the slider below and watch what happens to a single pair as its position climbs:
Watch the length chip: it stays pinned at 1.00 no matter how far you spin. That is rotation's superpower: position changes the direction, never the magnitude.
Notice the one thing that never changes: the arrow’s length. A rotation changes direction, never magnitude, and that is what makes it a rotation. That single property already takes care of the first flaw. A word’s meaning lives in the length and shape of its vector, and spinning it leaves all of that alone. Position ends up written purely into the angle, kept separate from content.
Length preservation is only the warm-up, though. The real payoff is what rotation does to the dot product, and to see it we need one fact about how attention scores tokens.
Why a dot product only feels the angle between
Attention decides how much chased should attend to dog by taking the dot product of chased’s query vector with dog’s key vector. A big dot product means strong attention.
The dot product has a clean geometric meaning. For any two vectors $q$ and $k$,
$$q^{\top} k \;=\; \|q\|\,\|k\|\,\cos\phi,$$where $\phi$ is the angle between them. The lengths $\|q\|$ and $\|k\|$ are fixed properties of the two words, a kind of loudness. Everything about how the two words relate is carried by that single $\cos\phi$ term. Vectors pointing the same way score high ($\cos 0 = 1$), perpendicular vectors score zero, and opposed vectors score negative.
So an attention score is really a question about the angle between two arrows. That is the sentence to hold onto. If position is an angle, and attention only responds to angles, then position and attention are speaking the same language.
The magic: relative position, for free
Now put the two halves together. Rotate chased’s query by its position $m\theta$ and dog’s key by its position $n\theta$. What is the angle between them afterwards?
Rotating one vector by $m\theta$ and comparing it against another rotated by $n\theta$ composes into a single rotation by the difference. Writing it out with the matrices, and using the identity $R(a)^{\top}R(b) = R(b-a)$, the rotated dot product collapses:
$$\big(R(m\theta)\,q\big)^{\!\top}\big(R(n\theta)\,k\big) \;=\; q^{\top}\,R(m\theta)^{\top}R(n\theta)\,k \;=\; q^{\top}\,R\!\big((n-m)\theta\big)\,k.$$Look at the right-hand side. The two absolute positions $m$ and $n$ are gone. Only their difference $n-m$ remains. The attention score between chased and dog depends only on how far apart they are, not on where the pair happens to sit in the sentence.
That takes care of the second flaw. We never asked the model to learn how to convert absolute positions into relative ones; the geometry does it on its own. Relative position falls out of the structure for nothing, with zero parameters spent.
Try it. Move the query and key positions on their own, then press “shift both +1,” which is the same as prepending “Yesterday,” to the sentence. Both arrows spin, but the offset between them, and the score, stay put:
The score chip is glued to m − n. Shifting both positions sends the arrows spinning, yet the number never flinches. That is relative position, for free, with nothing learned.
This is the payoff that made RoPE win. Shift the whole sentence and chased’s relationship to dog survives intact, because that relationship was stored as the angle between the two vectors, and shifting both simply rotates them together. One thing the demo quietly does, though, is hold the two words’ content fixed so the positional effect stands on its own. In a real attention head the content match is the dominant, learned signal, and rotation only modulates it. The next section puts content back in.
How rotation fits inside attention
It is easy to leave that demo thinking the rotation is the whole story. It is not. To see why, it helps to look at where RoPE actually sits inside an attention head, and at how much it leaves alone.
RoPE rotates only the query and key. The value vector, the softmax, and the weighted sum that follows are exactly as they were. Position is a small insertion into the score, not a new attention mechanism.
Attention scores a query against a key with a dot product, and before any position is involved that dot product is pure content matching. Think of it as a tiny search engine running in every layer. Each token sends out a query, a search for what it wants (“I am a verb, I am looking for my subject”). Every other token offers a key, an advertisement for what it is (“I am a noun, I could be a subject”). The dot product scores how well the advertisement answers the search. All of this is learned, by the projection matrices $W^Q$ and $W^K$, and it is the main event. It is what lets chased know it wants nouns and not commas.
RoPE does not touch any of that. It rotates the query and key after they are built, and because a rotation changes direction but not length, the learned content survives untouched. For a single pair of dimensions the score works out to
$$\text{score} \;\approx\; \underbrace{\|q\|\,\|k\|}_{\text{how strong}}\;\cos\big(\,\underbrace{\alpha}_{\text{content}} \,+\, \underbrace{(m-n)\,\theta}_{\text{position}}\,\big),$$where $\alpha$ is the angle between the two words’ content directions. Read the cosine as taking in two things at once. The content angle $\alpha$ is small when the words genuinely match and large when they do not. The positional term $(m-n)\theta$ is a fixed turn that depends only on how far apart they are. The score is highest when the content matches and the relative distance is one the head cares about. A perfect content match at an unwanted distance gets pulled down, and a favoured distance with mismatched content still scores low. Both have to agree.
Our running example needs both halves. Content alone tells chased to look at nouns, but dog and cat are equally nouns, so content cannot say which is the subject. Position alone would prefer whatever sits one step back, but it cannot tell a noun from a comma. Put them together and chased attends to dog because dog is both a content match and at the relative offset the head has learned to read as a subject. RoPE did not pick dog. Content narrowed the field to nouns, and rotation broke the tie by distance.
And everything downstream is left alone. The value vectors are never rotated, the softmax is the same softmax, and the weighted sum that produces the head’s output is unchanged. RoPE is a small, surgical edit to one quantity, the query-key score, not a new attention mechanism.
One frequency isn’t enough: a clock with many hands
A single rotation speed has a catch, because a circle wraps around. If every pair spun at the same rate $\theta$, two positions a full turn apart would land in the same place, and you could not tell them apart. One spinning hand cannot tell you the time on its own.
A clock fixes this with several hands at different speeds. The second hand resolves fine detail, the hour hand tracks the long sweep, and together they pin down a single moment. RoPE does the same thing, giving each coordinate pair $i$ its own rotation speed:
$$\theta_i = b^{-2i/d}, \qquad b = 10000, \qquad i = 0, 1, \dots, \tfrac{d}{2}-1.$$The first pairs spin fast, so they resolve fine, local distances like “one token apart.” The last pairs spin slowly, tracking coarse, long-range position across thousands of tokens. Stack them all and every position gets a unique multi-frequency fingerprint, with no wraparound ambiguity across the range that matters.
Watch the bank of dials below. The leftmost races while the rightmost barely moves.
Drag slowly: the leftmost dial races around while the rightmost barely budges. (For clarity the visual uses a gentler base than the real 10000, but the principle is identical.)
The full RoPE rotation is just this whole set of 2-D rotations stacked into one large block-diagonal matrix, with each pair of dimensions spun at its own frequency. Conceptually it is the picture above, repeated $d/2$ times.
Two things are worth noticing now, because they come back when we talk about long context. The base $b = 10000$ is a knob you can turn. And the fast pairs are the ones that wrap around soonest.
A free locality prior: nearby leans in, distant fades
The many frequencies do one more thing, almost by accident. When you add up the cosines across all the pairs, they reinforce at zero distance and start to interfere as the distance grows. The result is that the raw attention score between two identical vectors is high when they sit close together and decays, with a gentle ripple, as they move apart.
$$\text{score}(\Delta)=\frac{1}{d/2}\sum_{i} \cos\big(\Delta\,\theta_i\big), \qquad \Delta = m - n.$$Slide the dimension count and watch the curve: more frequencies, smoother decay, sharper peak at $\Delta = 0$.
Peaks at Δ = 0, then settles toward zero for far-apart tokens. A soft "pay more attention to what's near" prior, with no parameters spent.
So RoPE quietly hands the model a sensible default, namely “pay more attention to what is near,” without spending a single parameter on it. The model can override that prior when it needs to reach far, but it starts from a reasonable place.
Adding moves the point; rotating keeps it honest
We can now see, side by side, why rotation beats addition. Sinusoidal encoding adds a position vector, so the point drifts off its circle: its length changes, and content gets tangled up with position. RoPE rotates instead, so the point glides along its circle, its length perfectly preserved and position kept separate from meaning.
add sinusoidal E + PE(m)
rotate RoPE R(mθ)·E
The left length wobbles as position changes, which is the word's meaning being disturbed. The right length is rock-steady.
Sweep the position. On the left the length wobbles, which is the word’s meaning being disturbed as it moves through the sentence. On the right it stays rock-steady. Same goal of encoding position, very different treatment of the content.
Why it won
Pull back, and the list of advantages is long for something so simple:
- Relative position, for free. The attention dot product depends only on $m-n$. The model never has to learn to subtract positions, because it is guaranteed by construction.
- Meaning stays intact. Rotation preserves length, so a token’s content is not corrupted by where it sits, unlike additive encodings, which blur the two together.
- Applied where it matters. RoPE rotates the queries and keys inside every attention layer, right where the comparison happens, instead of being bolted once onto the input embedding and left to fade.
- Zero extra parameters. It is a fixed geometric operation. There is nothing to train, almost nothing to compute, and it composes cleanly with efficient-attention kernels like FlashAttention.
- A built-in locality prior. Scores naturally taper with distance, a free and sensible default.
- It stretches. Because it encodes relative distance through smooth, tunable frequencies, RoPE can be rescaled to longer sequences far more gracefully than anything before it, which is the whole reason it underpins modern long-context models.
That last point is where this post ends and the next one begins.
The bridge: from rotation to long context
This last point is also where the trouble starts. The same frequency structure that makes RoPE so elegant puts a hard ceiling on context length.
A model trained with a context window of, say, 4K tokens has only ever seen rotation angles up to $4096 \cdot \theta_i$ for each pair. The fast pairs will have swept through their whole range many times inside those 4K tokens, while the slow pairs have turned only a fraction of a circle. The network has learned to read positions inside that envelope of angles, and nowhere else.
Now feed it a 100K-token prompt at inference. The fast pairs are suddenly spinning to phase angles the model has never seen in training. As far as the network is concerned, the positions have gone out of distribution. Attention destabilizes, and quality falls off a cliff long before the prompt ends.
This is why context extension has become its own discipline. Every major technique is, underneath, a way of manipulating the angles and frequencies we just built up. Position Interpolation squeezes the positions back into the trained range. NTK-aware scaling turns that base-frequency knob $b$ up so the fast pairs slow down. YaRN interpolates each frequency band differently. None of them make much sense until you can see position as a rotation, which you now can.
That is the subject of the next post in this series.
RoPE also shows up in surprising places elsewhere on this blog. The evolution of attention post shows how DeepSeek’s Multi-head Latent Attention has to do some delicate surgery, a decoupled form of RoPE, to stay compatible with rotary embeddings while compressing the KV cache. The state-space models post shows Mamba-3 reusing the same rotary machinery with data-dependent angles. The idea travels a long way.
Lessons for builders
A few takeaways that generalize beyond RoPE:
- Positional information wants to be relative. When you catch yourself making a model re-derive the same relationship at every absolute offset, look for a representation where that relationship is built in rather than learned.
- Magnitude-preserving operations keep signals clean. Rotation works partly because it refuses to touch the content’s length. When you have to inject one kind of information into a vector that already carries another, prefer transforms that leave the existing signal undisturbed.
- The base frequency is a real knob, not a constant. That $10000$ is not sacred; long-context models routinely raise it to $500{,}000$ or $1{,}000{,}000$ to slow the fast pairs down. When a hyperparameter is set “because the paper said so,” it is worth knowing what it actually controls.
- The elegant default and its failure mode are two sides of one coin. The frequencies that give RoPE its clean relative encoding are the same ones that go out of distribution past the training length. The mechanism and its breaking point cannot be pulled apart, so understanding one means understanding the other.
Conclusion
RoPE comes down to one choice: to encode position, rotate the query and key instead of adding to them. Rotation keeps each vector’s length, so a word’s meaning stays intact, and since an attention score depends only on the angle between two vectors, the score ends up tracking how far apart two tokens are rather than where they sit.
If you want the original source, it is the RoFormer paper by Su et al. It is short and readable, and after this it should be easy to follow. It is also the groundwork for the next question in this series: how a model trained on a few thousand tokens manages to read much longer inputs.