What This Post Covers

NVIDIA’s Nemotron-3-Super is not a Transformer. Not entirely. It is a hybrid architecture that interleaves Mamba-2 SSM layers with select attention layers, using the majority of its compute on state space operations rather than self-attention. It ships in production on NVIDIA’s inference stack and competes with pure Transformer models at the same scale. NVIDIA is not alone. IBM’s Granite 4.0 uses a 9:1 SSM-to-Transformer ratio. AI21’s Jamba uses 1:7. Zyphra’s Zamba, Google’s Griffin, Microsoft’s Phi-4-mini-flash-reasoning: all hybrid architectures, all in production.

Something shifted. For years, the Transformer was the only architecture that mattered for language. Now every major AI lab is replacing most of their Transformer layers with SSM layers and getting better results at lower inference cost. If you deploy models, manage GPU clusters, or care about inference latency, this is worth understanding deeply.

This post builds State Space Models from zero. I start with the simplest possible differential equation: one variable, one parameter. From there, I build up to the full SSM formulation, explain the key breakthroughs (HiPPO, S4), and walk through the three generations of Mamba. By the end, you will understand the math well enough to explain to your team why these hybrid architectures are winning, what trade-offs they make, and what it means for your inference stack.

Part 1: Why SSMs? The Transformer’s Inference Problem

You already know the Transformer’s self-attention mechanism scales quadratically with sequence length: $O(L^2)$ in both time and memory. But the pain runs deeper than asymptotic notation.

During autoregressive decoding, the Transformer generates one token at a time. For each new token, it must load the entire KV cache from GPU HBM into SRAM, compute a single attention score against every previous token, and write the new KV entry back. The GPU spends the vast majority of its time moving data, not computing. On an H100 generating tokens from a 70B model, the Tensor Cores that deliver 989 TFLOPS of BF16 matmul sit almost entirely idle during decoding. The bottleneck is memory bandwidth, not compute.

This is why you need PagedAttention to manage fragmented KV cache memory. This is why vLLM exists: to batch requests efficiently despite variable KV cache sizes. This is why context windows beyond 128K tokens start requiring multi-GPU setups just to hold the KV cache.

State Space Models offer a fundamentally different deal. Instead of caching every token’s key-value pair (lossless but expensive), they compress the entire sequence history into a fixed-size hidden state (lossy but cheap). Processing each new token during inference takes $O(1)$ time and memory. No growing KV cache. No PagedAttention. Constant memory per sequence regardless of whether you have processed 100 tokens or 100,000.

The question has always been whether a compressed, lossy state can match the quality of the Transformer’s lossless KV cache. For years, the answer was no. SSMs excelled on audio, time series, and synthetic long-range benchmarks, but they lagged on language. The Mamba line of work changed that. To understand how, we need to start from scratch.

Part 2: State Space Models from Scratch

A Single Differential Equation

Forget matrices, vectors, and neural networks for a moment. Start with a single number $h(t)$ that changes over time:

$$h'(t) = a \cdot h(t)$$

$h'(t)$ is the rate of change of $h$ at time $t$. If you know calculus, this is the derivative. If not, think of it as: “how fast is $h$ changing right now?” When $h'(t)$ is positive, $h$ is increasing. When negative, $h$ is decreasing. When zero, $h$ is holding steady.

The constant $a$ controls everything:

  • $a > 0$: $h$ grows exponentially. Think compound interest. A bank account with interest rate $a = 0.05$ earns interest on its interest, accelerating upward forever. Unstable.
  • $a < 0$: $h$ decays exponentially. Think radioactive decay. A substance with decay rate $a = -0.5$ loses half its remaining mass roughly every 1.4 time units. The more you have, the faster it drains, but it never quite reaches zero. Stable.
  • $a = 0$: nothing happens. $h$ is constant forever.

How the parameter a controls state behavior

Drag the slider to see h(t) = h(0) · exp(a · t)

-0.5 Stable (decay)

For building sequence models, we want $a < 0$. Our hidden state should be a fading memory, not an explosion.

Adding an Input

A decaying state by itself is useless. We need to feed information in:

$$h'(t) = a \cdot h(t) + b \cdot x(t)$$

Now $x(t)$ is an input signal (think: a stream of token embeddings arriving over time), and $b$ controls how strongly the input drives the state.

Picture a leaky bucket with a tap. The water level $h(t)$ is the state. The hole in the bottom drains water at rate $a \cdot h(t)$: the more water in the bucket, the faster it leaks (more pressure = faster drain). The tap $b \cdot x(t)$ pours water in at a rate proportional to the input signal. The water level at any moment reflects a fading, weighted average of all the water that has ever been poured in, with recent additions contributing more because older ones have partially leaked away.

The Leaky Bucket: Core SSM Intuition

State as a water level — inputs pour in, decay leaks out

b · x(t)(input)h(t) = statea · h(t)(leak / decay)
The water level at any moment = fading, weighted average of all past inputs

This is the core intuition for the entire SSM line of work. The hidden state $h(t)$ is a running, compressed summary of the input history, where old inputs fade at a rate controlled by $a$.

Adding an Output

We read out the state with simple scaling:

$$y(t) = c \cdot h(t)$$

The output $y(t)$ is just a weighted view of the state. Together, these two equations form the complete scalar SSM:

$$h'(t) = a \cdot h(t) + b \cdot x(t) \quad \text{(state equation)}$$

$$y(t) = c \cdot h(t) \quad \text{(output equation)}$$

Three parameters. One input, one hidden state, one output. This is the entire architecture, in its simplest form.

Why One Bucket Is Not Enough

A single leaky bucket has one leak rate, which means one timescale of memory. If $a = -0.5$, the state “forgets” with a half-life of about 1.4 time units. It cannot simultaneously maintain a short-term memory (last few tokens) and a long-term memory (paragraph-level context).

The fix: use $N$ buckets, each with a different leak rate.

Multiple Timescales via Multiple State Dimensions

Each dimension has its own decay rate (eigenvalue)

1 state dimension
One leak rate = one timescale
Generalize to N dimensions
λ₁ = −0.01
(long memory)
λ₂ = −0.1
 
λ₃ = −0.5
 
λ₄ = −2.0
(short memory)

This is where scalars become vectors. The scalar state $h(t)$ becomes an $N$-dimensional vector $\mathbf{h}(t) \in \mathbb{R}^N$. The scalar parameters become matrices:

  • $A \in \mathbb{R}^{N \times N}$ (state to state): governs how each of the $N$ state dimensions evolves and potentially interacts with the others. It is $N \times N$ because each state dimension can influence every other state dimension.
  • $B \in \mathbb{R}^{N \times 1}$ (input to state): fans a scalar input out into $N$ state dimensions. It is $N \times 1$ because it needs to distribute one input value across $N$ state slots. Think of it as an adapter between a narrow input pipe and a wide state vector.
  • $C \in \mathbb{R}^{1 \times N}$ (state to output): narrows the wide state back down to a scalar output. It is $1 \times N$ because it takes a weighted combination of all $N$ state dimensions to produce one output value.
$$\mathbf{h}'(t) = A \cdot \mathbf{h}(t) + B \cdot x(t)$$

$$y(t) = C \cdot \mathbf{h}(t)$$

SSM as a Pipeline: How Dimensions Flow

Input fans out to N-dimensional state, then narrows back to output

x(t) ∈ ℝ scalar input
B ∈ ℝᴺˣ¹
fans out
h(t) ∈ ℝᴺ N-dim state
A ∈ ℝᴺˣᴺ
state dynamics
h(t) ∈ ℝᴺ N-dim state
C ∈ ℝ¹ˣᴺ
narrows back
y(t) ∈ ℝ scalar output

In practice, $A$ is almost always diagonal. A diagonal $A$ means each state dimension evolves independently. No cross-talk between buckets. Dimension 1 decays at its own rate, dimension 2 at its own rate, and so on. This simplification works just as well empirically (the S4D paper proved this) and is much cheaper to compute.

Eigenvalues: The Retention Rates

For a diagonal $A$, the diagonal entries ARE the eigenvalues. No linear algebra required to understand this. Each eigenvalue $\lambda_i$ is simply the decay rate of one state dimension. Think of them as $N$ different bank account interest rates running simultaneously:

  • $\lambda_i = -0.01$: very slow decay. This dimension remembers inputs from thousands of timesteps ago. It is the long-term savings account.
  • $\lambda_i = -0.5$: moderate decay. This dimension tracks information over dozens of timesteps.
  • $\lambda_i = -2.0$: fast decay. This dimension mostly tracks the last few inputs. It is the checking account that turns over quickly.
  • $\lambda_i > 0$: growth. Unstable. The state explodes. We never want this.

By having $N$ state dimensions with different eigenvalues, the model simultaneously maintains memory at multiple timescales. Some dimensions track recent tokens (large negative eigenvalues, fast decay), others preserve long-range context (small negative eigenvalues, slow decay).

This is why the initialization of $A$ matters enormously. If you set eigenvalues randomly, you get random memory timescales, and the model struggles to learn useful representations of sequences. Random eigenvalues might cluster all your memory in one timescale, leaving gaps in others. This is exactly the problem that HiPPO solved. But we need to cover discretization first.

Part 3: Discretization, Making It Computable

Why Discretize

The continuous ODE $\mathbf{h}'(t) = A\mathbf{h}(t) + Bx(t)$ processes smooth, continuous signals. But an LLM does not receive continuous signals. It receives a discrete sequence of tokens, one after another. We need to convert the continuous dynamics into a step-by-step recurrence: given the previous state and the current token, compute the next state.

The Step Size $\Delta$

Discretization introduces a learnable parameter $\Delta$ that controls “how much continuous time passes between tokens.” Small $\Delta$ means the model takes fine-grained steps, preserving detailed temporal structure. Large $\Delta$ means coarse steps, compressing more time into each token. Each channel in the model can learn its own $\Delta$, so different parts of the network can operate at different temporal resolutions.

The Euler Discretization

The simplest approach: approximate the derivative as constant over each timestep, following the Euler method from numerical analysis. This gives us discrete parameters:

$$\bar{A} = I + \Delta A$$

$$\bar{B} = \Delta B$$

This is first-order accurate, with local truncation error $O(\Delta^2)$. There are more accurate methods (zero-order hold, bilinear transform), but Euler is the one that matters for the Mamba story because Mamba-3 directly improves on it.

The Discrete Recurrence

The discretized system gives us a step-by-step formula. At each timestep $k$:

$$\mathbf{h}_k = \bar{A} \cdot \mathbf{h}_{k-1} + \bar{B} \cdot x_k$$

$$y_k = C \cdot \mathbf{h}_k$$

This is now a simple step-by-step formula. Given the previous state and the current input, compute the next state and output. No calculus needed at runtime.

Numerical Walkthrough

Let me make this concrete. Take a scalar system with $a = -0.5$, $b = 1.0$, $c = 1.0$, and step size $\Delta = 0.1$.

$$\bar{a} = 1 + (-0.5)(0.1) = 0.95 \qquad \bar{b} = 0.1$$

Now run 5 timesteps with the input sequence $[1, 1, 0, 0, 1]$, starting from $h_0 = 0$:

# a_bar = 0.95, b_bar = 0.1

# k=0: input=1  h = 0.95 * 0      + 0.1 * 1 = 0.1000   y = 0.1000
# k=1: input=1  h = 0.95 * 0.1    + 0.1 * 1 = 0.1950   y = 0.1950
# k=2: input=0  h = 0.95 * 0.195  + 0.1 * 0 = 0.1853   y = 0.1853
# k=3: input=0  h = 0.95 * 0.1853 + 0.1 * 0 = 0.1760   y = 0.1760
# k=4: input=1  h = 0.95 * 0.176  + 0.1 * 1 = 0.2672   y = 0.2672

The state accumulates when inputs arrive (steps 0-1, step 4) and decays when they stop (steps 2-3). You can verify every number with a calculator. There is nothing hidden in the SSM recurrence: it is a multiply-and-add, repeated.

The Dual Computation Modes

This is the SSM’s defining superpower. In the formulation above, $A$, $B$, and $C$ are constants that do not change over time. This property is called Linear Time-Invariance (LTI), and it unlocks something powerful. Because the recurrence $\mathbf{h}_k = \bar{A}\mathbf{h}_{k-1} + \bar{B}x_k$ is linear with fixed parameters, we can unroll it algebraically:

$$\mathbf{h}_k = \bar{A}^k \bar{B} x_0 + \bar{A}^{k-1}\bar{B}x_1 + \cdots + \bar{B}x_k$$

The output $y_k = C\mathbf{h}_k$ is then a weighted sum of all past inputs, with weights $K = (C\bar{B},\ C\bar{A}\bar{B},\ C\bar{A}^2\bar{B},\ \ldots)$. This sequence of weights is a convolution kernel.

This means we can compute the output of the SSM in two completely different ways:

Training mode (convolution): Compute the kernel $K$ once, then convolve it with the entire input sequence via FFT in $O(L \log L)$. Fully parallel, like a CNN. The GPU processes all $L$ tokens simultaneously.

Inference mode (recurrence): Step through $\mathbf{h}_k = \bar{A}\mathbf{h}_{k-1} + \bar{B}x_k$ one token at a time in $O(1)$ per step. The state $\mathbf{h}$ is a fixed-size vector regardless of how many tokens have been processed. No KV cache.

The Dual Computation Modes

Same model, two ways to compute — optimized for each phase

Training Mode
Convolution (Parallel)
x₁ x₂ x₃ ... x_L
K̄ (conv kernel)
y = x * K̄ (via FFT)
O(L log L)

All tokens processed simultaneously

Inference Mode
Recurrence (Sequential)
x₁ x₂ x₃ ...
x_t
h(t)
y_t
h(t) = Āh(t−1) + B̄x(t)
state stays fixed-size N
O(1) per token

Fixed-size state, no KV cache needed

Train like a CNN, infer like an RNN

“Train like a CNN, infer like an RNN.” This is the fundamental efficiency proposition. During training, you get the parallelism of convolutions. During inference, you get the constant-time, constant-memory decoding of RNNs, without the KV cache that makes Transformer inference expensive.

This duality is only possible because the system is LTI: the parameters $A$, $B$, $C$ are fixed, so the same convolution kernel $K$ applies to every input. When parameters become input-dependent (which is what Mamba does), there is no single kernel for the whole sequence. The duality breaks, and new algorithms are needed.

Part 4: HiPPO, The Initialization That Made SSMs Work

Before HiPPO, SSMs initialized the state matrix $A$ randomly. Random eigenvalues produce random memory timescales. On Sequential MNIST (classifying a handwritten digit fed one pixel at a time, 784 steps), random initialization achieved about 60% accuracy. Barely above chance for some digit classes.

Albert Gu’s HiPPO framework (2020) solved this by deriving $A$ matrices from a mathematical objective: at every timestep, the state should store the best polynomial approximation of the entire input history. Each state dimension corresponds to one polynomial coefficient, with low-order coefficients capturing broad trends (long-range memory) and high-order coefficients capturing fine details (short-range memory). The resulting $A$ matrix has eigenvalues arranged to cover multiple timescales without redundancy.

The concrete impact: switching from random $A$ to HiPPO improved Sequential MNIST from 60% to 98%. Same architecture, same training. Only the initialization of $A$ changed.

Part 5: S4 and S4D, Making SSMs Practical

S4 (Gu, Goel, and Re, 2022) was the first architecture to make deep SSMs work at scale by finding an efficient algorithm to compute the convolution kernel from a HiPPO-initialized $A$ matrix. It was the first model to solve long-range tasks at sequence lengths of 16,000+, a result no Transformer or RNN had achieved. S4 also fully exploited the recurrent-convolutional duality: convolution mode for training, recurrence mode for inference.

A key simplification followed quickly. S4D (2022) showed that restricting $A$ to a fully diagonal matrix matched S4’s performance while dramatically simplifying the implementation. Independent state dimensions with well-chosen eigenvalues were sufficient. This diagonal restriction became the standard for all subsequent work, including Mamba.

Part 6: Mamba-1, Selectivity Changes Everything

The LTI Problem

S4 and its variants excelled on continuous signals and synthetic long-range benchmarks. On language modeling, they consistently lagged behind Transformers of the same size.

The reason is exactly the LTI limitation I described earlier. In an LTI system, the matrices $A$, $B$, $C$ are fixed constants. Every token receives identical treatment. The Mamba paper demonstrated this failure precisely with two diagnostic tasks:

Selective Copying: Given “A B _ _ _ C _ _ A _ _”, copy only A, B, C while ignoring the padding underscores. An LTI system cannot distinguish content tokens from padding because it applies the same transformation to everything.

Induction Heads: Given “A B … A ?”, recall that B followed A earlier and predict B. This requires content-based lookup: comparing the current token (A) against stored tokens to find what came after it. An LTI system has no mechanism for content comparison.

Language is full of these patterns. The word “not” should be remembered differently from the word “the.” A name mentioned once in a document needs to be retrievable later. All of this requires the model to make content-dependent decisions about what to store and what to forget.

The Fix: Input-Dependent Parameters

The December 2023 paper “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” by Albert Gu and Tri Dao introduced a single, elegant idea: make $B$, $C$, and $\Delta$ functions of the input token.

$$B_t = \text{Linear}(x_t) \in \mathbb{R}^N$$

$$C_t = \text{Linear}(x_t) \in \mathbb{R}^N$$

$$\Delta_t = \text{softplus}(\text{Linear}(x_t)) \in \mathbb{R}^+$$

Now the model dynamically modulates its behavior on a per-token basis. When the network encounters an important token, it can predict a large $\Delta_t$ to reset the state and absorb the new information. When it encounters filler, it can predict a tiny $\Delta_t$ to preserve existing memory and let the filler leak away.

The roles of each parameter are clear. $\Delta$ controls the gate: large $\Delta$ resets the state and focuses on the current input; small $\Delta$ persists the state and ignores the current input. $B$ controls what enters the state (content-based filtering of what to remember). $C$ controls what exits (content-based modulation of what to read out).

Note that the state matrix $A$ itself remains fixed. This is intentional. $A$ affects the discrete recurrence only through its interaction with $\Delta$ via $\bar{A} = \exp(\Delta A)$, so making $\Delta$ input-dependent is sufficient to make the entire system input-dependent.

The Cost: Convolution Mode Breaks

Input-dependent parameters mean the system is no longer LTI. $B_t$ and $C_t$ change at every timestep, so there is no single convolution kernel $K$ that describes the entire sequence. The FFT-based parallel training mode is gone.

Naively, this forces sequential computation: process token 1 to get $h_1$, then token 2 to get $h_2$, and so on. This would be catastrophically slow on GPUs, which need parallel workloads to achieve decent utilization.

Mamba-1 solved this with a hardware-aware selective scan algorithm, directly inspired by FlashAttention’s approach to the memory hierarchy. The key idea: fuse all SSM operations (discretization, recurrence, output) into a single GPU kernel that runs entirely in SRAM, avoiding expensive HBM round-trips. The recurrence is parallelized using a parallel scan that exploits the associativity of the multiply-add operation, and intermediate states are recomputed in the backward pass rather than stored. The result: 40x faster than a naive implementation, with the same memory footprint as an optimized Transformer with FlashAttention.

The Mamba Block

A common misconception is that Mamba replaces only the attention layer. It replaces both attention AND the MLP. A standard Transformer decoder block has two sub-layers: multi-head self-attention (which mixes information across sequence positions) and a feed-forward network (which mixes information across feature channels). The Mamba block handles both in a single, unified structure.

Transformer vs Mamba: Block Architecture

Same job — sequence mixing + channel mixing — different structure

Transformer Decoder Block

Input
LayerNorm
Multi-Head Self-Attention
Sequence Mixing
+
LayerNorm
Feed-Forward Network / MLP
Channel Mixing
+
Output

Mamba Block

Input
LayerNorm
Linear Projection (expand)
split into two branches
Conv1d
SiLU
Selective SSM
Seq. Mixing
SiLU
Gate / Ch. Mixing
Linear Projection
+
Output
Mamba replaces both attention and MLP in a single block

Here is how the Mamba block works. The input ($B \times L \times D$) passes through a LayerNorm and is linearly projected to expand the feature dimension by a factor of $E = 2$. This expanded representation is then split into two parallel branches:

The SSM branch (left): Processes through a short 1D causal convolution (width 4) to capture immediate local patterns between neighboring tokens. Then a SiLU activation. Then three parallel linear projections produce the token-specific $\Delta_t$, $B_t$, and $C_t$. The selective SSM recurrence runs using these dynamic parameters. This branch handles sequence mixing: how information flows across token positions.

The gate branch (right): Takes the other half of the expanded input and passes it through a SiLU activation. This branch serves as a dynamic gate that controls which channels of the SSM output are passed through and which are suppressed.

The two branches merge via element-wise multiplication. If elements in the gating vector are near zero, the corresponding SSM information is suppressed. The result passes through a linear projection back to dimension $D$ and is added to the input via a residual connection.

The entire block is one homogeneous module. No separate attention layer. No separate MLP.

Inference: The Fundamental Trade-off

TransformerMamba-1
State per sequenceKV cache: grows with each tokenHidden state: fixed-size vector
Memory complexity$O(L)$ per sequence$O(1)$ per sequence
Compute per new token$O(L)$: attend to all previous tokens$O(1)$: one state update
At 128K context~200x larger than Mamba state~2.6 MiB per sequence
Memory typeLossless: any past token retrievableLossy: compressed summary

Inference Memory: The Fundamental Trade-off

Lossless (KV cache) vs Compressed (hidden state)

Transformer: Growing KV Cache

Tokens processed: 0
O(L) memory per sequence

Mamba: Fixed-Size State

h ∈ ℝN
Tokens processed: 0
O(1) memory per sequence

The trade-off is fundamental. The Transformer’s KV cache stores every token (lossless, but $O(L)$ memory). Mamba’s hidden state compresses all history into a fixed vector (lossy, but $O(1)$ memory). The question is always whether the compressed representation is good enough for the task.

Mamba-1 demonstrated that it was. Mamba-2.8B matched or exceeded Pythia-6.9B (a model more than twice its size) on zero-shot downstream evaluations. On the Pile dataset, Mamba-1.4B achieved 59.7% average across common-sense reasoning benchmarks, matching Pythia-2.8B (59.1%). At batch size 16, Mamba-2.8B completed generation in 18.6 seconds versus GPT-Neo-2.7B’s 65.9 seconds (3.5x faster). GPT-Neo ran out of memory at batch size 32 on a 64GB GPU; Mamba continued scaling to batch 128+.

Part 7: Mamba-2, Maximizing GPU Utilization

Mamba-1 had an embarrassing practical problem: it was 2-3x slower than equivalently sized Transformers during training. The root cause is that modern GPUs deliver roughly 16x more throughput for matrix multiplication (via Tensor Cores) than for general arithmetic. Transformers are pure matmul. Mamba-1’s selective scan was not.

Tri Dao and Albert Gu’s May 2024 paper “Transformers are SSMs” solved this by proving that unrolling the SSM recurrence produces a structured matrix that can be computed via matrix multiplications. The resulting algorithm (SSD) splits the sequence into chunks: within each chunk, the computation runs as dense matmuls on Tensor Cores; between chunks, a short scan passes state forward. Training speed improved 2-8x over Mamba-1.

The trade-off: to fit this matrix framework, $A$ is restricted from a diagonal matrix (Mamba-1) to a scalar-times-identity (Mamba-2), meaning all state dimensions within a head share one decay rate. Mamba-2 compensates with a multi-head structure and increases the state dimension from $N = 16$ to $N = 64\text{-}256$.

Part 8: Mamba-3, Three Innovations from Classical SSM Theory

Published at ICLR 2026, Mamba-3 asks a different question than its predecessors. Mamba-2 optimized for training speed by simplifying the SSM to leverage Tensor Cores. But with the rise of RL post-training, agentic workflows, and test-time compute scaling, inference efficiency has become the primary bottleneck.

Here is the problem Mamba-3 targets. During autoregressive decoding, Mamba-2’s simplified recurrence is memory-bound. The GPU loads the state from HBM to SRAM, performs a trivially small computation (the scalar-times-identity update is cheap), and writes the result back. The arithmetic intensity is roughly 2.5 ops/byte. The H100 needs $\sim$295 ops/byte to be compute-bound. More than 99% of GPU compute sits idle during token generation.

Mamba-3’s overarching philosophy is to increase arithmetic intensity during decoding by making the state update mathematically richer, spending more compute per byte of memory traffic, filling idle GPU cycles rather than adding new ones. Three innovations accomplish this.

Innovation 1: Exponential-Trapezoidal Discretization

The problem. Mamba-1 and Mamba-2 used what the Mamba-3 authors retroactively classify as “Exponential-Euler” discretization: the exact formula $\bar{A} = \exp(\Delta A)$ paired with the first-order Euler approximation $\bar{B} = \Delta B$. This is a hybrid: exact for the state decay, but approximate for how the input enters the state. The local truncation error is $O(\Delta^2)$.

In numerical analysis terms, the Euler method approximates the area under a curve using a rectangle aligned to one endpoint. It captures the value at the start of the interval but ignores how the curve changes across the interval. This crude approximation struggles with fast-moving temporal dependencies, producing “jerky” transitions in the state.

In practice, prior Mamba models compensated by adding an explicit short 1D causal convolution (Conv1d, width 4) before the SSM. This Conv1d smoothed out immediate local token interactions that the imprecise discretization missed. It worked, but it was an architectural bandage for a mathematical shortcoming. And it added latency at inference: one more sequential operation per token.

The intuition. The trapezoidal rule approximates the area under a curve using a trapezoid instead of a rectangle. A rectangle uses only one endpoint’s value. A trapezoid uses both endpoints and draws a straight line between them, capturing the slope of the curve across the interval. This gives second-order accuracy: the local error drops from $O(\Delta^2)$ to $O(\Delta^3)$.

Why Trapezoidal Discretization Is More Accurate

Same interval, better approximation of the area under the curve

Euler Method

f(t)tt+Δerror
Uses only the left endpoint f(t)
O(Δ²) local error

Trapezoidal Rule

f(t)tt+Δtiny error
Uses both endpoints f(t) and f(t+Δ)
O(Δ³) local error

The math. Applying the generalized trapezoidal rule to the SSM’s state equation produces a three-term recurrence, where the old Euler formula had only two:

$$h_t = \underbrace{\exp(\Delta_t A_t) \cdot h_{t-1}}_{\text{Term 1: decayed previous state}} + \underbrace{(1 - \lambda_t) \cdot \Delta_t \cdot \exp(\Delta_t A_t) \cdot B_{t-1} \cdot x_{t-1}}_{\text{Term 2 (NEW): previous input, decayed}} + \underbrace{\lambda_t \cdot \Delta_t \cdot B_t \cdot x_t}_{\text{Term 3: current input}}$$

Term 1 is identical to the old formula: decay the previous state. Term 3 is similar to the old Euler input term, but weighted by $\lambda_t$ instead of 1. The new addition is Term 2: the previous timestep’s input $x_{t-1}$, projected through $B_{t-1}$, scaled by $(1 - \lambda_t)$, and decayed by the same exponential factor as the state.

The parameter $\lambda_t$ is a data-dependent convex combination weight. The $(1-\lambda_t)$ and $\lambda_t$ coefficients on consecutive inputs are the weights of the trapezoid: $(1-\lambda)$ on the left endpoint (previous input), $\lambda$ on the right endpoint (current input).

Let me verify the connection to the old formula. When $\lambda_t = 1$: Term 2 vanishes entirely because its coefficient $(1-\lambda_t) = 0$. Term 3 becomes $1 \cdot \Delta_t \cdot B_t \cdot x_t = \Delta_t B_t x_t$, which is exactly the Euler formula $\bar{B}x_t$. So the old Mamba-1/2 discretization is the special case $\lambda = 1$ of this more general formula.

A numerical walkthrough. Take $a = -0.5$, $\Delta = 0.1$, $b = 1$, and process the input sequence $[1, 1, 0]$ with $\lambda = 0.5$ (balanced trapezoidal blending):

# Parameters: a = -0.5, delta = 0.1, b = 1.0, lambda = 0.5
# exp(delta * a) = exp(-0.05) = 0.9512

# k=0: input=1 (no previous input, Term 2 uses x_{-1}=0)
#   Term 1: 0.9512 * 0 = 0                              (no state yet)
#   Term 2: (1-0.5) * 0.1 * 0.9512 * 1.0 * 0 = 0        (no prev input)
#   Term 3: 0.5 * 0.1 * 1.0 * 1 = 0.05
#   h_0 = 0 + 0 + 0.05 = 0.0500

# k=1: input=1, prev_input=1
#   Term 1: 0.9512 * 0.05 = 0.04756                     (decayed state)
#   Term 2: 0.5 * 0.1 * 0.9512 * 1.0 * 1 = 0.04756     (prev input, decayed)
#   Term 3: 0.5 * 0.1 * 1.0 * 1 = 0.05                  (current input)
#   h_1 = 0.04756 + 0.04756 + 0.05 = 0.14512

# k=2: input=0, prev_input=1
#   Term 1: 0.9512 * 0.14512 = 0.13804                  (decayed state)
#   Term 2: 0.5 * 0.1 * 0.9512 * 1.0 * 1 = 0.04756     (prev input=1, decayed)
#   Term 3: 0.5 * 0.1 * 1.0 * 0 = 0.0                   (current input=0)
#   h_2 = 0.13804 + 0.04756 + 0.0 = 0.18560

# Compare to Euler (lambda=1):
# k=2 with Euler: h = 0.9512 * 0.1904 + 0.0976 * 0 = 0.18107

At step 2, the trapezoidal version ($h = 0.1856$) is higher than the Euler version ($h = 0.1811$). The difference comes from Term 2: even though the current input is 0, the trapezoidal rule still accounts for the previous input ($x_1 = 1$) via the left endpoint of the trapezoid. The Euler method ignores this entirely. For fast-changing input sequences, this difference matters.

Trapezoidal Recurrence: Step k=2

Three terms instead of two — the previous input now participates

h₁ = 0.14512
× exp(ΔA) = × 0.9512
0.13804
Decayed previous state
NEW in Mamba-3
x₁ = 1
× (1−λ)·Δ·exp(ΔA)·B = × 0.04756
0.04756
Previous input (decayed)
x₂ = 0
× λ·Δ·B = × 0.05
0.0
Current input
+
h₂ = 0.18560

The implicit convolution and the death of Conv1d. Here is the subtle and important consequence. Because the state update at time $t$ depends on both $x_t$ and $x_{t-1}$, the trapezoidal recurrence contains an implicit convolution of width 2. The $(1-\lambda_t)$ and $\lambda_t$ weights on the consecutive inputs play the role of a learned, data-dependent convolution filter operating on pairs of adjacent tokens.

For years, SSM architectures (H3, RWKV-4, Mamba-1, Mamba-2) required an explicit external Conv1d (width 4) before the SSM to handle immediate local token interactions. The Conv1d was considered essential for capturing “induction head” copying behaviors and local patterns. Mamba-3 found that the implicit width-2 convolution from the trapezoidal discretization, combined with learnable bias terms on $B$ and $C$ (constant vectors added after normalization), is expressive enough to replace the external Conv1d entirely. Mamba-3 is the first Mamba variant to drop the Conv1d without performance loss.

How Mamba-3 Eliminates the Conv1d

Trapezoidal discretization absorbs the causal convolution

Mamba-1 / Mamba-2

Input
Conv1d
width 4
SiLU
Selective SSM
Output
Absorbed into discretization

Mamba-3

Input
Conv1d
Selective SSM
Trapezoidal Discretization
implicit conv built-in
Output
Fewer ops per token = lower decode latency on your H100

Why it matters for your H100. Fewer sequential operations per token at inference. No Conv1d kernel launch, no Conv1d memory traffic, no Conv1d compute. The architecture is simpler and the discretization is now theoretically justified ($O(\Delta^3)$ error) instead of a heuristic patched by an external convolution.

Innovation 2: Complex-Valued SSMs via RoPE

The problem. Real-valued SSMs with non-negative eigenvalues can only decay monotonically. The state gets smaller over time, or stays the same, but it cannot oscillate. Mathematically: if $\bar{a} \in [0, 1]$, then $\bar{a}^k$ is a monotonically decreasing sequence. The state can only move in one direction (toward zero).

This means real-valued SSMs cannot solve simple state-tracking tasks that require flipping between states. Consider parity: given a stream of bits, track whether the running count of 1s is even or odd. Every time a 1 arrives, the parity flips. This requires the state to toggle between two values indefinitely. A monotonically decaying state cannot do this. On the bit sequence parity task, Mamba-2 scored no better than random guessing.

The intuition. Real eigenvalues restrict the state to movement along a line: it can only grow or shrink. Complex eigenvalues enable rotation: the state can cycle through values, oscillate, and flip. For even/odd tracking, you need the state to flip sign every time a 1 appears. A 180-degree rotation achieves exactly this. Real, non-negative arithmetic cannot.

The clever trick. Implementing complex arithmetic on GPUs is painful. Complex numbers double memory requirements, break existing CUDA kernel optimizations, and introduce alignment issues. Mamba-3 avoids all of this through a mathematical equivalence.

The key theoretical result (Proposition 3 in the paper): a discretized complex-valued diagonal SSM is mathematically equivalent to a real-valued SSM with data-dependent Rotary Positional Embeddings (RoPE) applied to $B$ and $C$. The decomposition works as follows:

  • The real part of the complex eigenvalue controls decay. This is handled by the existing SSD machinery, exactly as in Mamba-2. No changes needed.
  • The imaginary part controls rotation. This is factored out and implemented as rotary embeddings applied to the $B$ and $C$ projection vectors.

The rotation angles are produced dynamically via projections from the current input token $x_t$, rather than using static positional indices as in standard Transformer RoPE. This is why it is called “data-dependent” RoPE. The rotation applied to $B$ and $C$ changes based on what token is being processed.

No complex number ever appears in the GPU kernels. The real-valued SSD computation runs at the same speed as before. The rotational dynamics are absorbed into $B$ and $C$ via the same RoPE infrastructure that Transformers already use for positional encoding. Existing Transformer tooling (rotary embedding kernels, fused attention implementations) can be reused directly.

The result. On synthetic state-tracking benchmarks:

TaskMamba-2 (Real)Mamba-3 w/o RoPEMamba-3 w/ Std. RoPEMamba-3 (Complex / Data-Dep RoPE)
Bit Sequence ParityRandom Guess2.27%1.56%100.00%
Modular Arith. (No Brackets)0.90%Random Guess20.70%98.51%
Modular Arith. (Brackets)FailFail2.62%87.75%

Mamba-3 solves parity perfectly and near-perfectly executes complex modular arithmetic. Standard (non-data-dependent) RoPE does not help. Static positional rotation angles cannot implement content-dependent state flipping. The data-dependent version can, because the rotation is a function of the input.

These are tasks that were mathematically impossible for real-valued SSMs, regardless of scale or training budget. The real-complex boundary is a hard expressivity ceiling, not a soft scaling issue.

Innovation 3: MIMO (Multi-Input Multi-Output)

The problem. As discussed above, standard SSMs waste more than 99% of GPU compute during decoding because each state update is a trivial rank-1 operation: the GPU loads the entire state from memory, performs a single multiply-add, and writes it back. The computation is too cheap relative to the memory transfer.

The fix. Instead of processing one input and producing one output per SSM (SISO), process $R$ inputs and $R$ outputs simultaneously (MIMO). The scalar input $x_t$ is linearly projected into a matrix $X_t$ with rank $R$. The projection vectors $B_t$ and $C_t$ are correspondingly expanded to rank-$R$ structures. The state update becomes a matrix multiplication instead of an outer product:

$$H_t = \bar{A} \cdot H_{t-1} + B_t \cdot X_t^T$$

With $R = 4$, the model performs 4x the floating-point operations for the same amount of memory traffic. The arithmetic intensity jumps from $\sim$2.5 to $\sim$10 ops/byte. Still not enough to fully saturate the H100, but a 4x improvement in GPU utilization during the memory-bound decode phase.

Crucially, only the SSM-specific parameters ($B_t$, $C_t$, and the state $H_t$) grow with $R$. The main input projections, the output projections, and the residual gate all remain at their original sizes. This contains the parameter increase to the SSM core.

Why it does not hurt latency. The extra compute fills idle GPU cycles. During decoding, the bottleneck is the time it takes to load the state from HBM to SRAM. While that data transfer is in flight, the Tensor Cores have nothing to do. MIMO gives them work. The wall-clock time per decode step is dominated by memory transfer time, not compute time, so adding compute within the transfer window is effectively free. MIMO with $R = 4$ matches Mamba-2’s decode speed while delivering substantially better accuracy.

The result. At 1.5B scale with Chinchilla-optimal training: the base Mamba-3 (SISO) outpaces Gated DeltaNet (the previous state-of-the-art sub-quadratic model) by 0.6 percentage points on average downstream accuracy. Adding MIMO with $R = 4$ adds another 1.2 points, for a total gain of 1.8 points over Gated DeltaNet, 1.9 over Mamba-2, and 2.2 over equivalently-sized pure Transformers. The 1.5B MIMO variant achieves 57.6% average accuracy across benchmarks.

A Mamba-3 MIMO model with state dimension $N = 64$ matches the perplexity and downstream accuracy of a Mamba-2 model with $N = 128$. Halving the state size while maintaining quality doubles inference throughput within the same hardware footprint.

Architectural Changes

Two more structural changes round out Mamba-3:

Normalization. Mamba-3 replaces post-gate RMSNorm (Mamba-2) with QKNorm, also called BCNorm: RMS normalization applied directly to the $B$ and $C$ projections before mixing. This stabilizes variance and activation spikes during large-scale pretraining, which is especially important with the added mathematical complexity of trapezoidal recurrence and MIMO updates.

Block structure. Mamba-1 and Mamba-2 fused the sequence mixer (SSM) and channel mixer (MLP) into a single homogeneous block. Mamba-3 reverses this decision. It adopts an interleaved architecture that matches the Llama family: alternating Mamba-3 SSM blocks with standard SwiGLU MLP blocks. Each Mamba-3 block handles sequence mixing; each SwiGLU MLP handles channel mixing. This Llama-compatible topology makes it straightforward to create hybrid models by swapping some SSM blocks for attention blocks.

Evolution Comparison

FeatureMamba-1Mamba-2Mamba-3
VenueCOLM 2024ICML 2024ICLR 2026
$A$ matrixDiagonal (real)Scalar $\times$ identityComplex-valued (data-dep RoPE)
State size$N = 16$$N = 64\text{-}256$Matches Mamba-2 at half $N$
Short convRequired (width 4)Required (width 4)Removed
MIMONoNoYes (rank-$R$)
DiscretizationExp-Euler, $O(\Delta^2)$Exp-Euler, $O(\Delta^2)$Exp-Trapezoidal, $O(\Delta^3)$
Design priorityQuality + selectivityTraining speed (Tensor Cores)Inference efficiency
State trackingCannot solve parityCannot solve paritySolves parity + modular arith.
Block structureFused SSM+MLPFused SSM+MLPInterleaved SSM + MLP

Part 9: The Bigger Picture

Hybrid Architectures Are the Production Standard

The field has converged on an empirical finding: hybrid architectures combining SSM layers with a small fraction of attention layers outperform both pure approaches. Albert Gu has articulated the fundamental reason clearly. Transformers are like databases: they cache every token for future reference (perfect recall, but linear memory growth). SSMs are like brains: they compress all history into a fixed-size state (infinite context, but lossy).

Pure SSMs struggle with two specific capabilities. Exact retrieval: finding a specific fact buried in a long context degrades as context grows, because the fixed-size state cannot perfectly memorize arbitrary content. In-context learning: few-shot pattern matching from prompt examples requires comparing the current token against specific stored tokens, which is fundamentally an attention operation.

The solution adopted by every major lab: use SSM layers for the vast majority of the network and sprinkle in a few attention layers for precise retrieval. The exact ratio varies (5:1 in Mamba-3’s recommended config, 9:1 in Granite), but the pattern is universal.

What This Means for Your Inference Stack

For inference infrastructure teams, the implications are concrete. With only 10-15% of layers using attention, you manage KV cache for those few layers, not the entire network. SSM layers need no KV cache management at all: no PagedAttention, no eviction policies, no memory fragmentation. And throughput advantages grow with context length. Pure Transformers are faster at short sequences ($<$2K tokens), but SSM-based models cross over quickly and the gap widens: at 57K tokens, Mamba-2 outperforms Transformers by 4x. SSM decode cost is constant per token; Transformer decode cost grows linearly.

The Fundamental Trade-off Persists

SSMs compress. Attention caches. Mamba-3 makes the compressed memory more expressive through complex dynamics, higher-order discretization, and MIMO. But it cannot eliminate the compression. If your workload requires perfect verbatim retrieval of a specific sentence from a 100K-token document, you need attention layers for that.

The Transformer monopoly has ended. But Transformers are not dead. They are becoming a specialized, strategically-placed component within a larger hybrid architecture, used precisely where their lossless memory is needed and nowhere else.


References

  1. Gu, A., Dao, T., Ermon, S., Rudra, A., & Re, C. (2020). HiPPO: Recurrent Memory with Optimal Polynomial Projections. NeurIPS 2020.

  2. Gu, A., Goel, K., & Re, C. (2022). Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022. (S4)

  3. Gu, A., Gupta, A., Goel, K., & Re, C. (2022). On the Parameterization and Initialization of Diagonal State Space Models. NeurIPS 2022. (S4D)

  4. Gu, A. & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. COLM 2024.

  5. Dao, T. & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. ICML 2024. (Mamba-2 / SSD)

  6. Lahoti, A., Li, A., Chen, Y., Wang, Z., Bick, T., Kolter, J. Z., Dao, T., & Gu, A. (2026). Mamba-3: Improved Sequence Modeling using State Space Principles. ICLR 2026.

  7. Together AI. Mamba-3 Blog Post. Technical overview and benchmark results.

  8. Goomba Lab. Blog series on Structured State Space Duality and Mamba-3 mathematical foundations.

  9. Tri Dao. Mamba-3 Part 2: Methodological Deep Dive. Detailed derivation of the exponential-trapezoidal discretization and RoPE equivalence.

  10. Princeton Language and Intelligence. Mamba-2: Algorithms and Systems. Technical walkthrough of the SSD algorithm.

  11. NVIDIA. Nemotron-3-Super: hybrid Mamba-2 + MoE + Attention architecture for production deployment.

  12. IBM. Granite 4.0: 9:1 Mamba-to-Transformer ratio with >70% memory reduction vs. conventional LLMs.