The Speed Problem Wasn’t Always About Compute

In late 2023, two independent research teams at Google and DeepMind published papers with remarkably similar insights. Both had discovered a way to make large language models generate text 2-3× faster without approximations, without quality loss, and without changing the output distribution at all. The technique was speculative decoding.

Here’s the counterintuitive reality: when you run a 70B parameter model on a modern GPU, most of the computational units sit idle. The expensive tensor cores that can perform trillions of operations per second spend the majority of their time doing nothing, waiting. They’re waiting for data to arrive from memory. This is the memory bandwidth bottleneck, and it’s the reason that making LLMs faster is about doing more useful work with each expensive memory read.

Speculative decoding exploits this idle capacity in an elegant way: use a small, fast model to guess what tokens the big model will produce, then verify those guesses in parallel. When the guesses are right and they often are you’ve generated multiple tokens for the price of one memory read of the large model’s weights.

GLM-4.7, Zhipu AI’s 355B parameter flagship released in December 2025, takes this further by building Multi-Token Prediction directly into its architecture. With vLLM’s optimized implementation, this achieves acceptance rates exceeding 90% and generation speeds beyond 100 tokens per second, a glimpse of where inference optimization is heading.

Why LLM Inference Is Memory-Bound

To understand why speculative decoding works, we need to understand why LLM inference is slow in the first place.

Consider what happens when a 70B parameter model generates a single token. The GPU must:

  1. Load the model’s ~140GB of weights from High Bandwidth Memory (HBM)
  2. Perform matrix multiplications with the current token’s hidden states
  3. Produce probability distribution over the vocabulary
  4. Sample one token
  5. Repeat for the next token

The critical insight is in step 1. An NVIDIA H100 GPU can perform roughly 2,000 trillion floating-point operations per second (TFLOPS). But its memory bandwidth—the rate at which it can read data from HBM—is “only” 3.35 TB/s.

Let’s do the arithmetic. Loading 140GB of weights at 3.35 TB/s takes about 42 milliseconds. The actual matrix multiplications for a single token? Perhaps 1-2 milliseconds of computation. The GPU spends roughly 95% of its time waiting for memory transfers and only 5% doing actual math.

This ratio is captured by a metric called arithmetic intensity: the number of floating-point operations performed per byte of memory transferred. For autoregressive LLM inference at batch size 1, arithmetic intensity is approximately 1-2 FLOP/byte. Modern GPUs are designed for workloads with arithmetic intensity of 100+ FLOP/byte. The mismatch is severe.

If we could somehow verify multiple tokens in a single forward pass, we’d amortize that expensive 42ms memory read across several tokens instead of just one. This is the aim of speculative decoding.

The Draft-Verify Paradigm

The speculative decoding algorithm operates in a simple loop:

Draft Phase: A small, fast “draft” model autoregressively generates γ candidate tokens. Because this model is 50-100× smaller than the target, its memory reads are proportionally faster.

Verify Phase: The large “target” model processes all γ candidates in a single forward pass. Thanks to the parallelism of transformer attention, scoring γ tokens takes nearly the same time as scoring 1 token—the memory bandwidth cost is identical.

Accept/Reject Phase: Compare the draft model’s predictions against the target model’s true probabilities. Accept tokens that match well; reject and resample where they diverge.

Here’s the algorithm in pseudocode:

def speculative_decode(prefix, draft_model, target_model, γ):
    # Step 1: Draft γ tokens autoregressively (cheap)
    drafts = []
    for i in range(γ):
        q_i = draft_model(prefix + drafts)
        x_i = sample(q_i)
        drafts.append(x_i)

    # Step 2: Score all positions in parallel (expensive, but single pass)
    p_1, ..., p_{γ+1} = target_model(prefix, prefix+x_1, ..., prefix+x_1...x_γ)

    # Step 3: Accept/reject with rejection sampling
    n = 0  # number of accepted tokens
    for i in range(γ):
        if random() < min(1, p_i(x_i) / q_i(x_i)):
            n += 1
        else:
            # Reject: resample from adjusted distribution
            return prefix + drafts[:n] + sample(normalize(max(0, p_i - q_i)))

    # All accepted: bonus token from final position
    return prefix + drafts + sample(p_{γ+1})

The key is in step 3. When we accept a draft token, we move forward. When we reject, we don’t just discard the draft, we sample from an adjusted distribution that “fills in” exactly the probability mass the draft model missed. This ensures the output distribution is mathematically identical to standard autoregressive decoding.

The Math of Distribution Preservation

This is the part that makes speculative decoding remarkable. The output distribution is exactly the same as if you had run standard autoregressive decoding with the target model alone. Understanding why requires examining the rejection sampling mechanism.

Let $p(x)$ denote the target model’s probability distribution and $q(x)$ denote the draft model’s distribution. For a draft token $x’$, we accept it with probability:

$$\alpha(x’) = \min\left(1, \frac{p(x’)}{q(x’)}\right)$$

When rejected, we resample from the adjusted distribution:

$$p’(x) = \text{normalize}\left(\max(0, p(x) - q(x))\right)$$

The key theorem is that this process produces samples from $p(x)$. Here’s the proof:

$$P(X = x’) = P(\text{accepted}, X = x’) + P(\text{rejected}, X = x’)$$

For the accepted case, we sample $x’$ from $q$ and accept with probability $\min(1, p(x’)/q(x’))$:

$$P(\text{accepted}, X = x’) = q(x’) \cdot \min\left(1, \frac{p(x’)}{q(x’)}\right) = \min(q(x’), p(x’))$$

For the rejected case, we first reject (with probability $1 - \alpha$), then resample from $p’$:

$$P(\text{rejected}, X = x’) = \left(1 - \sum_x \min(p(x), q(x))\right) \cdot \frac{\max(0, p(x’) - q(x’))}{\sum_x \max(0, p(x) - q(x))}$$

The denominator normalizes to $1 - \sum_x \min(p(x), q(x))$, so:

$$P(\text{rejected}, X = x’) = \max(0, p(x’) - q(x’)) = p(x’) - \min(p(x’), q(x’))$$

Adding both cases:

$$P(X = x’) = \min(p(x’), q(x’)) + p(x’) - \min(p(x’), q(x’)) = p(x’)$$

This proof holds regardless of how good the draft model is. A poorly aligned draft simply increases rejection rate without corrupting the output distribution. The guarantee is unconditional.

Building Intuition for Rejection Sampling

Let’s build some intuition for why rejection sampling works.

Imagine two probability distributions over possible next tokens. The target distribution $p(x)$ represents what the large model actually wants to output. The draft distribution $q(x)$ represents the small model’s best guess.

Picture these as two overlapping curves. Where they overlap—where the draft model agrees with the target—we can safely use the draft’s samples. The acceptance probability $\min(1, p/q)$ ensures we never accept a token more often than the target model would generate it.

Distribution Overlap & Acceptance

Visualizing how draft model alignment affects token acceptance probability

α = Σ min(p(x), q(x))
0.85
85% of draft tokens accepted on average
Alignment
Misaligned Identical
Presets
View

Target p(x)

Draft q(x)

Overlaid Comparison

Accept region: min(p, q)
Overshoot: q > p (rejected)
Residual: p > q (on reject)
Key Insight

The green overlap shows probability mass that can be safely accepted from the draft model. When q(x) > p(x), the draft "overshoots" and risks rejection. The purple residual fills in when we reject, ensuring the output matches the target distribution exactly.

Token
p(x): 0.35
q(x): 0.30
Accept prob: min(1, p/q) = 1.0

But what about the probability mass where $p(x) > q(x)$? These are tokens the target model likes more than the draft model expected. If we only accepted, we’d undersample these tokens. The resampling step corrects for this: when we reject, we draw from exactly this “missing” probability mass.

The total acceptance rate, the probability we accept any draft token equals the overlap between distributions:

$$\alpha = \sum_x \min(p(x), q(x))$$

This quantity has a nice interpretation: it’s 1 minus half the total variation distance between $p$ and $q$. When distributions are identical, $\alpha = 1$ and we always accept. When they’re completely disjoint, $\alpha = 0$ and we always reject.

In practice, well-matched draft-target pairs achieve $\alpha = 0.6-0.8$, while architecturally integrated solutions like GLM-4.7’s native MTP exceed 0.9.

A Concrete Walkthrough of Rejection Sampling

Let’s ground the mathematics in concrete examples to build deeper intuition for how the algorithm actually works.

The Sequential Verification Problem

When the draft model generates K tokens, each token is conditioned on the previous ones:

$$x_1 \sim q(\cdot)$$ $$x_2 \sim q(\cdot|x_1)$$ $$x_3 \sim q(\cdot|x_1,x_2)$$

The target model verifies by computing in parallel:

$$p(x_1), \quad p(x_2|x_1), \quad p(x_3|x_1,x_2), \quad \ldots$$

if you reject $x_2$, then $x_3$ was generated from the wrong context.

The draft model generated $x_3$ assuming $x_2$ was correct. But if you reject $x_2$ and resample a different token $x_2’$, then $x_3$ is now invalid, ie it was conditioned on a token that no longer exists in the sequence.

Concrete Example:

Draft generates:  "The cat sat on the [mat]"
                                        ↑ rejected, resample → "rug"

Draft's x₆ was:   "mat" → next token "." (conditioned on "mat")
But now we have:  "rug" → we can't use "." anymore!

The token after “mat” might have been “.” with high probability, but the token after “rug” might be “was” or something entirely different. You must discard everything after the rejection point and let the target model generate the next token fresh.

What p(x) and q(x) Actually Mean

The notation can obscure what’s happening. Let’s be concrete.

$x_1$ is a specific token that was sampled—say, the token “cat” (token ID 9846 in the vocabulary).

$p(x_1)$ is the scalar probability that the target model assigned to that exact token:

# Target model forward pass
logits = target_model(prompt)        # shape: [vocab_size]
probs = softmax(logits)              # shape: [vocab_size]

p_x1 = probs[9846]                   # scalar: 0.073

Similarly, $q(x_1)$ is what the draft model assigned to that same token:

# Draft model forward pass
logits = draft_model(prompt)         # shape: [vocab_size]
probs = softmax(logits)              # shape: [vocab_size]

q_x1 = probs[9846]                   # scalar: 0.051

The acceptance check compares these two scalars:

ratio = p_x1 / q_x1                  # 0.073 / 0.051 = 1.43
acceptance_prob = min(1, ratio)      # min(1, 1.43) = 1.0

u = random.uniform(0, 1)             # say, 0.67
if u < acceptance_prob:              # 0.67 < 1.0 → True
    accept()

The ratio $p(x)/q(x)$ asks: “Did the draft model over- or under-estimate this token?”

ScenarioRatioAccept ProbMeaning
p=0.30, q=0.103.01.0 (capped)Draft underestimated—always accept
p=0.10, q=0.101.01.0Perfect agreement—always accept
p=0.05, q=0.100.50.5Draft overestimated—accept 50%
p=0.01, q=0.100.10.1Draft way overconfident—accept 10%

When the draft model is overconfident about a token ($q > p$), you reject proportionally to correct the bias. When the draft is underconfident ($q < p$), you always accept—the residual distribution handles the gap.

Why Resample from Residual, Not Just p?

When you reject a drafted token, you need to pick a new token. The naive answer is: “We want the output to follow $p$, so just sample from $p$.”

This is wrong. Let me show you why with a concrete example.

Two-token vocabulary: A and B

Target:  p(A) = 0.7,  p(B) = 0.3
Draft:   q(A) = 0.4,  q(B) = 0.6

Tracing through the algorithm:

Step 1: Sample from draft $q$

  • 40% chance we draft A
  • 60% chance we draft B

Step 2: Accept/reject check

If we drafted A: $$\text{accept prob} = \min\left(1, \frac{p(A)}{q(A)}\right) = \min\left(1, \frac{0.7}{0.4}\right) = \min(1, 1.75) = 1.0$$

A is always accepted when drafted.

If we drafted B: $$\text{accept prob} = \min\left(1, \frac{p(B)}{q(B)}\right) = \min\left(1, \frac{0.3}{0.6}\right) = \min(1, 0.5) = 0.5$$

B is accepted 50% of the time when drafted.

Calculating the probabilities:

$$P(\text{accept A}) = q(A) \times 1.0 = 0.4$$ $$P(\text{accept B}) = q(B) \times 0.5 = 0.3$$ $$P(\text{reject}) = 1 - 0.4 - 0.3 = 0.3$$

The problem with resampling from p:

If on rejection we resample from $p$:

$$P(\text{output}=A) = P(\text{accept A}) + P(\text{reject}) \times p(A)$$ $$= 0.4 + 0.3 \times 0.7 = 0.4 + 0.21 = 0.61$$

This is wrong—should be 0.7!

$$P(\text{output}=B) = P(\text{accept B}) + P(\text{reject}) \times p(B)$$ $$= 0.3 + 0.3 \times 0.3 = 0.39$$

This is wrong—should be 0.3!

The fix: residual distribution

The residual distribution is:

$$\max(0, p(A) - q(A)) = \max(0, 0.7 - 0.4) = 0.3$$ $$\max(0, p(B) - q(B)) = \max(0, 0.3 - 0.6) = 0.0$$

Normalized: $p’(A) = 1.0$, $p’(B) = 0.0$

Now:

$$P(\text{output}=A) = P(\text{accept A}) + P(\text{reject}) \times p’(A)$$ $$= 0.4 + 0.3 \times 1.0 = 0.7 \checkmark$$

$$P(\text{output}=B) = P(\text{accept B}) + P(\text{reject}) \times p’(B)$$ $$= 0.3 + 0.3 \times 0.0 = 0.3 \checkmark$$

The Probability Budget Intuition

Think of it as a budget you need to fill for each token:

TokenTarget p(x)Covered by Accept PhaseStill Needed
A0.7min(0.7, 0.4) = 0.40.7 - 0.4 = 0.3
B0.3min(0.3, 0.6) = 0.30.3 - 0.3 = 0.0

The accept phase already “spent” $\min(p,q)$ probability on each token. The residual distribution captures exactly what’s left to fill:

$$p’(x) = \frac{\max(0, p(x) - q(x))}{Z} = \frac{\text{what we still need}}{\text{total rejection probability}}$$

The Probability Budget

How rejection sampling fills the exact probability mass for each token

Target p(x)

A
0.70
B
0.30

Draft q(x)

A
0.40
B
0.60
Step 1: Target Budget
Token A target: 0.70
0.40
0.30
min(0.7, 0.4) = 0.4 0.7 − 0.4 = 0.3
Token B target: 0.30
0.30
0.00
min(0.3, 0.6) = 0.3 0.3 − 0.3 = 0.0
Target budget p(x)
Accept phase: min(p,q)
Residual: max(0, p−q)
Probability Accounting
P(accept A) = q(A) × 1.0 = 0.4 × 1.0 = 0.40
P(accept B) = q(B) × 0.5 = 0.6 × 0.5 = 0.30
P(reject) = 1 − 0.4 − 0.3 = 0.30
Residual p′(A) = 0.3 / 0.3 = 1.0
Residual p′(B) = 0.0 / 0.3 = 0.0
P(output=A) = 0.4 + 0.3 × 1.0 = 0.70
P(output=B) = 0.3 + 0.3 × 0.0 = 0.30
If resample from p: P(A) = 0.4 + 0.3 × 0.7 = 0.61
P(B) = 0.3 + 0.3 × 0.3 = 0.39
Key Insight

The residual distribution p′(x) precisely fills the probability gap left by the accept phase. Token B is already "fully funded" by accepts (min(p,q) = p), so it gets zero in the residual. Token A needs exactly 0.3 more probability—which the residual provides with 100% certainty when rejection occurs.

The Full Algorithm Timeline

Step 1: Draft model runs K times (cheap, fast)
        [x₁] → [x₂] → [x₃] → [x₄] → [x₅]

Step 2: Target model runs ONCE (expensive, but parallel)
        [x₁, x₂, x₃, x₄, x₅] → [p₁, p₂, p₃, p₄, p₅]

Step 3: Sequential verify until rejection
        x₁ ✓ → x₂ ✓ → x₃ ✗ → STOP, discard x₄,x₅
                       ↓
                  resample x₃' ~ residual

Output: [x₁, x₂, x₃']

The key efficiency gain: that single target model forward pass would normally give you just 1 token. With speculation, you potentially get K+1 tokens from the same compute, paying only the small overhead of draft generation.

The Speedup Formula

How much faster does speculative decoding make inference? The expected number of tokens generated per iteration follows a capped geometric distribution.

If we propose γ tokens and each has acceptance probability α, the expected number of accepted tokens is:

$$E[\text{tokens per iteration}] = \frac{1 - \alpha^{\gamma+1}}{1 - \alpha}$$

For large γ, this approaches $\frac{1}{1-\alpha}$. With typical values (α = 0.75, γ = 5), we get roughly 4 tokens per expensive target model call.

The speedup formula must account for the cost of the draft model:

$$\text{Speedup} = \frac{1 - \alpha^{\gamma+1}}{(1-\alpha)(\gamma c + 1)}$$

where $c = t_{\text{draft}}/t_{\text{target}}$ is the ratio of draft model latency to target model latency. For a draft model 100× smaller, $c \approx 0.01-0.05$.

When $\alpha > c$, speedup is guaranteed. The minimum improvement is $(1 + \alpha)/(1 + c)$. Real-world benchmarks on H200 GPUs show Llama 3.1 405B with a Llama 3.2 3B draft achieving 3.6× speedup (33 → 120 tokens/sec).

The Variant Landscape

The field has evolved rapidly since 2023, with researchers finding increasingly clever ways to eliminate draft models or improve acceptance rates.

EAGLE: Feature-Level Speculation

EAGLE (ICML 2024) introduced feature-level speculation, predicting at the second-to-top layer rather than token level. The key insight: autoregression over continuous hidden states is easier than over discrete tokens.

Rather than training a separate small model, EAGLE trains a lightweight head (~1B parameters for 70B models) that extrapolates feature vectors. These features are then decoded to tokens and verified. The approach achieves 3× speedup, 1.6× faster than Medusa’s parallel heads approach.

EAGLE-2 added context-aware dynamic draft trees, adjusting speculation aggressiveness based on prediction confidence to reach 4.26× speedup.

Medusa: Parallel Prediction Heads

Medusa takes a different approach: add multiple single-layer prediction heads directly atop the frozen base model. Each head predicts a different future position independently.

Hidden State → Head 1 → Token +1
            → Head 2 → Token +2
            → Head 3 → Token +3

The Cartesian product of top-k predictions from each head creates candidate continuations verified via tree attention. Training requires only hours on a single A100.

The trade-off: position-independent heads can’t condition on earlier speculated tokens, limiting acceptance rates compared to EAGLE’s sequential feature prediction.

Self-Speculative Methods

LayerSkip (ACL 2024) eliminates external drafters entirely by using early exits from the target model itself. During training, layer dropout with increasing rates toward later layers plus early exit loss creates a model that can draft from shallow layers and verify with deep layers.

The catch: requires special training recipes. Baseline LLMs show no speedup with this approach.

MethodDraft ModelTrainingMemory OverheadSpeedupDistribution Preserved
Standard SDYes (separate)OptionalHigh1.5-2.5×Yes
EAGLE-2Lightweight head~2 daysLow-Medium3-4.3×Yes
MedusaNo (heads on base)HoursLow2.2-3.6×Optional

GLM-4.7: Native Multi-Token Prediction

GLM-4.7 represents a paradigm shift: rather than retrofitting speculative decoding onto existing models, Zhipu AI built Multi-Token Prediction directly into the architecture.

The model contains 355 billion total parameters with 32 billion active per forward pass via Mixture-of-Experts routing. This extreme sparsity only 9% of parameters active per token creates an ideal scenario for speculative decoding: massive memory reads but relatively modest compute.

The MTP Architecture

Traditional speculative decoding uses separate draft and target models. GLM-4.7’s MTP adds auxiliary prediction heads within the model itself:

Hidden State h_t → Main Head → P(x_{t+1} | h_t)     [standard next-token]
               → MTP Head  → P(x_{t+2} | h_t)     [speculative token]
               → MTP Head  → P(x_{t+3} | h_t)     [speculative token]

The MTP heads are lightweight projections sharing the same massive 32B active backbone. This ensures the draft distribution is tightly aligned with the target distribution, they share the same semantic understanding. Resulting in acceptance rates exceeding 90% with 1 speculative token.

Training the MTP Heads

The MTP layer was trained with loss weight λ = 0.3 for the first 15 trillion tokens, reduced to 0.1 later. This balances multi-token prediction quality against primary language modeling capability.

$$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{LM}} + \lambda \cdot \mathcal{L}_{\text{MTP}}$$

The reduced weight in later training prevents the MTP objective from interfering with the model’s core capabilities while still maintaining high acceptance rates at inference time.

Architectural Innovations

GLM-4.7 incorporates several architectural choices that complement its MTP capability:

  • Sigmoid-gated loss-free balance routing across ~160 experts (128 active per token)
  • 96 attention heads for 5120 hidden dimension (2.5× more heads than typical)
  • Grouped-Query Attention with partial RoPE at 1M base frequency for 200K context
  • QK-Norm for stabilized attention logits

The increased head count particularly improves reasoning benchmarks despite not improving training loss—an interesting finding suggesting that inference-time compute distribution matters.

vLLM Implementation: PagedAttention Meets Speculation

vLLM’s speculative decoding architecture consists of three phases orchestrated by the SpecDecodeWorker:

  1. Draft Runner: Proposes candidate tokens using MTP heads
  2. Target Runner: Scores all candidates in a single forward pass
  3. Rejection Sampler: Implements accept/reject logic

PagedAttention Integration

The integration with PagedAttention required non-trivial modifications. The memory manager tracks KV cache for both draft and target phases with block-level management enabling sharing, copying, and forking between sequences.

For MTP-style speculation, the draft phase reuses the target model’s KV cache infrastructure, minimizing overhead. The scheduler now supports “preallocated slots”—reserving KV block space sufficient for multiple tokens before the next scheduler invocation.

# GLM-4.7 with native MTP speculative decoding
vllm serve zai-org/GLM-4.7-FP8 \
    --tensor-parallel-size 4 \
    --speculative-config.method mtp \
    --speculative-config.num_speculative_tokens 1 \
    --tool-call-parser glm47

Why num_speculative_tokens=1?

The recommendation of 1 speculative token reflects empirical findings: higher values increase mean acceptance length but decrease acceptance rate, reducing overall throughput. The sweet spot maximizes expected tokens per iteration accounting for verification overhead.

With GLM-4.7’s 90%+ acceptance rate at num_speculative_tokens=1, you reliably get 2 tokens per forward pass. Increasing to 2 speculative tokens might yield an average of 2.5 tokens but with higher variance and occasional costly rejections.

Continuous Batching Challenges

Continuous batching with speculation creates the “ragged tensor problem”: different sequences accept different numbers of tokens per iteration, creating irregular batch shapes. At higher concurrency, this overhead consumes up to 40% of computation.

vLLM addresses this through dynamic speculation length adjustment based on system load—reducing speculation aggressiveness when batch sizes grow.

When Speculation Helps and When It Hurts

The fundamental principle: speculative decoding trades compute for memory bandwidth. When GPUs are memory-bound (most inference scenarios), spare compute cycles can profitably run draft verification. When GPUs are compute-saturated, speculation adds overhead without benefit.

Batch Size Dominates

ConditionImpactRecommendation
Batch size ≤ 8Strong benefit (1.5-2.7×)Enable with γ=4-8
Batch size > 32, short contextPotential slowdownDisable or use dynamic γ
Batch size > 32, long contextModerate benefit (up to 2×)Enable with small γ
QPS < 10Strong benefitEnable
QPS > 50Diminishing/negative returnsDynamic speculation
Acceptance rate < 0.5Marginal benefitImprove draft alignment

At batch size 1, GPUs run severely underutilized—speculative decoding achieves 2.73× speedup (63% latency reduction). Beyond batch size 16-32, benefits diminish and can reverse, causing 1.4-1.8× slowdown.

The Long Context Exception

MagicDec research found that at large batch sizes with long contexts, decoding becomes memory-bound again due to KV cache loading. Speculative decoding can provide 2× speedup even on 8 A100s with high concurrency when context lengths exceed 32K tokens.

INT4/INT8 quantization presents tradeoffs: aggressive weight quantization can reduce acceptance rates as draft model quality degrades. The QSpec approach uses W4A4 for drafting and W4A16 for verification, capturing benefits of both.

Where the Field Is Heading

The success of GLM-4.7’s native MTP suggests future models will ship with speculation built-in rather than bolted-on. Several trends are emerging:

Architectural Integration: Models trained with MTP objectives from the start achieve dramatically higher acceptance rates than retrofitted solutions. Expect this to become standard practice.

Dynamic Speculation: Rather than fixed speculation lengths, future systems will adjust aggressiveness based on:

  • Current batch size
  • Observed acceptance rates
  • Prediction entropy
  • Available compute headroom

Hardware Co-design: As speculative decoding becomes ubiquitous, GPU architectures may evolve to better support the draft-verify pattern with dedicated acceleration for the rejection sampling kernel.

Beyond Token Prediction: EAGLE’s feature-level speculation hints at richer speculation targets. Predicting structured outputs (tool calls, code blocks) could enable even higher acceptance rates for specialized workloads.

Conclusion

Speculative decoding achieves something rare in optimization: meaningful speedups without any quality tradeoff. The output distribution is mathematically identical to standard decoding.

The technique works because LLM inference is memory-bound, not compute-bound. By using idle GPU cycles to verify multiple speculative tokens in parallel, we amortize the expensive memory reads across several output tokens.

GLM-4.7’s native MTP architecture points toward where the field is heading: models designed from the ground up for efficient speculation, achieving 90%+ acceptance rates that make speculative decoding nearly as reliable as a lookup table.

References

  1. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. International Conference on Machine Learning.

    • The original Google paper introducing speculative decoding with rigorous distribution preservation proofs.
  2. Li, Y., Cai, T., Zhang, Y., Chen, D., & Dai, D. (2024). EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. International Conference on Machine Learning.

    • Feature-level speculation achieving superior speedups through hidden state prediction.
  3. Cai, T., Li, Y., Geng, Z., Peng, H., & Dao, T. (2024). Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. arXiv preprint.

    • Parallel prediction heads approach requiring minimal training overhead.
  4. Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., … & Acun, B. (2024). LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding. Association for Computational Linguistics.

    • Self-speculative decoding using early exits from the target model.
  5. Fu, Y., Bailis, P., Stoica, I., & Zhang, H. (2024). Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. International Conference on Machine Learning.

    • Training-free speculative decoding via Jacobi iteration.
  6. Zhipu AI. (2025). GLM-4.7: Advancing the Coding Capability. Hugging Face Model Card.

    • Technical documentation for GLM-4.7’s native MTP architecture.
  7. vLLM Team. (2025). Speculative Decoding Documentation. vLLM Documentation.

    • Implementation details for speculative decoding in vLLM.
  8. Chen, Z., Yang, X., Lin, J., Sun, C., Huang, J., & Chang, K. W. (2024). MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding. arXiv preprint.

    • Analysis of speculative decoding performance at high batch sizes with long contexts.