Attention

Rotary Positional Encoding: Why Position Is a Rotation

An intuitive, visual guide to Rotary Positional Encoding. Why spinning the query and key vectors beats stamping a position number onto them, why a dot product only ever feels the angle between two vectors, and why that hands you relative position for free. The starting point for understanding how LLMs stretch to long context.

The Evolution of Attention, Part 1: From MHA to Latent Compression

Part 1 of 2. Every attention variant since 2019 fights the same number: KV cache bytes per token. This post traces the first wave of answers, from MHA through MQA and GQA, to DeepSeek-V2’s Multi-head Latent Attention. We end at the 57× cache reduction that comes from caching a low-rank latent and never materializing K or V at inference.

Flash Attention: The Mathematical Tricks That Broke the Memory Wall

Flash Attention, a memory-efficient attention mechanism for transformers.