The Hidden Software Stack Behind Fast LLM Inference

The Iceberg Problem

If you’ve followed LLM infrastructure over the past two years, you’ve probably heard the greatest hits: PagedAttention eliminates memory fragmentation, continuous batching keeps GPUs busy, and FlashAttention cuts memory from O(N²) to O(N). These optimizations are real and important. They are not the full story.

Below the waterline sits a stack of specialized libraries that most engineers never encounter directly. CUTLASS generates the fused kernels that make quantization practical. Triton lets researchers write GPU code without drowning in thread indexing. FlashInfer handles the messy reality of serving workloads that FlashAttention wasn’t designed for. And NCCL quietly orchestrates communication when models span multiple GPUs.

This post dives into that hidden layer. We’ll trace the path from silicon to scheduler, examining the libraries that transform NVIDIA’s hardware capabilities into the fast inference you actually experience. If you’re deploying LLMs at scale, or simply curious about what happens beneath vLLM’s Python API, this is the stack worth understanding.

Hardware Contract

Every optimization in this stack exists because of a single physical constraint: the memory wall. Modern GPUs have a dramatic imbalance between compute capability and memory bandwidth.

Consider the H100. Its Tensor Cores can deliver roughly 2,000 TFLOPS of FP8 compute. Its HBM3 memory provides 3.35 TB/s of bandwidth. Simple division gives us a “ridge point” of about 600 ops/byte—if your workload performs fewer than 600 operations per byte loaded from memory, you’re memory-bound. Your expensive Tensor Cores sit idle, waiting for data.

LLM inference during the decode phase operates at roughly 0.5-1 ops/byte. For every token generated, the model loads billions of weight parameters, multiplies them by a single vector, and discards the weights. It’s not even close to compute-bound. This is why a $30,000 GPU often achieves single-digit percentage utilization during autoregressive generation.

To understand why, it helps to see what we’re working with.

NVIDIA H100 GPU Architecture

Understanding the hardware that software must optimize for

GPU Die Layout

Click components to explore

Streaming Multiprocessor (SM) Detail

Streaming Multiprocessors

132 SMs × (128 CUDA cores + 4 Tensor Cores)

The parallel processing units where computation happens. Each SM is an independent processor with its own registers, shared memory, and access to Tensor Cores for matrix operations.

HBM3 (80GB)

L2 Cache (50MB)

SMs (132)

Tensor Cores

Shared Memory

Registers

The memory hierarchy offers a path forward:

Level	Capacity	Bandwidth
HBM (Global Memory)	80 GB	3.35 TB/s
L2 Cache	50 MB	~12 TB/s
SRAM (Shared Memory)	228 KB/SM	~19 TB/s
Register File	256 KB/SM	Highest

The software stack’s job is to maximize data reuse in faster memory levels and minimize trips to slow HBM.

GPU Memory Hierarchy: The Bandwidth Wall

Data flows through progressively faster, smaller caches to reach compute

HBM3 (High Bandwidth Memory) 80 GB • 3.35 TB/s

"The Warehouse" — Large but far away

Model Weights KV Cache Activations

L2 Cache 50 MB • ~12 TB/s 3.6× faster

Shared across all SMs — first line of defense

Hot KV entries Recent weights

Shared Memory (SRAM) 228 KB/SM • ~19 TB/s 5.7× faster

On-chip scratchpad — FlashAttention's secret weapon

Attention tiles CUTLASS staging

Direct compute access — no latency

Active values Accumulators

Tensor Cores

~2000 TFLOPS (FP8)

Bandwidth Comparison

HBM

3.35 TB/s

~12 TB/s

SRAM

~19 TB/s

Hopper Accelerators

TMA (Tensor Memory Accelerator)

Offloads address calculation to hardware. Software describes tensor shape; TMA handles async loads.

WGMMA

Direct SRAM → Tensor Core path. Bypasses registers, enabling larger tiles and deeper pipelines.

The Memory Wall

LLM decode: 0.5-1 ops/byte (memory-bound)

Click any memory level to learn more. The 6× bandwidth gap between HBM and SRAM is why FlashAttention exists—keeping data in fast SRAM avoids the bottleneck.

CUTLASS: Template Metaprogramming Foundation

When you call a matrix multiplication in PyTorch, it eventually reaches cuBLAS—NVIDIA’s battle-tested linear algebra library. cuBLAS is fast, but it’s a black box. You get the GEMM you’re given.

For LLM inference, that’s often not enough. Consider what happens when you want to run an INT4 quantized model. The weights are stored as packed 4-bit integers. Before the Tensor Cores can process them, you need to:

Load 128-bit vectors containing packed INT4 weights
Unpack the 32-bit integers into eight 4-bit values
Convert to FP16
Apply quantization scales
Feed the result to the Tensor Core

If each step is a separate kernel, you’re writing intermediate results to HBM between operations—exactly the memory traffic you’re trying to avoid. What you need is a single fused kernel that does everything in registers.

This is what CUTLASS enables. It’s NVIDIA’s header-only C++ template library for linear algebra, and it’s the foundation beneath vLLM’s quantization kernels, FlashAttention-3, and most high-performance transformer implementations.

When cuBLAS Won’t Cut It

Use CUTLASS when you need:

Custom fusions: Bias + activation + quantization in one kernel
Specific precision combinations: FP8 weights with FP16 accumulation
Binary size constraints: cuBLAS ships megabytes of kernels for all cases

The trade-off is complexity. CUTLASS kernels require understanding GPU architecture at a level most ML engineers never encounter. But for the performance-critical paths in inference—attention, FFN, quantized projections—that complexity pays dividends.

Triton: GPU Programming Without the Pain

CUTLASS offers maximum control, but its learning curve is steep. Writing CUDA C++ means managing thread indices, avoiding bank conflicts, ensuring coalesced memory access, and reasoning about warp-level synchronization. A single misplaced __syncthreads() can introduce subtle bugs. A suboptimal memory access pattern can halve performance.

Triton takes a different approach. Developed by OpenAI and now integral to PyTorch 2.0, it raises the abstraction level from threads to blocks.

The Mental Model Shift

Traditional CUDA asks: “I am thread 47. What should I do?”

Triton asks: “I am processing this block of data. What operations should happen?”

Consider loading data from memory. In CUDA, you calculate addresses, handle boundary conditions, and coordinate across threads for coalescing. In Triton:

@triton.jit
def kernel(x_ptr, output_ptr, N, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
    mask = offsets < N
    x = tl.load(x_ptr + offsets, mask=mask)
    # Process x...
    tl.store(output_ptr + offsets, result, mask=mask)

The tl.load call handles coalescing and vectorization automatically. The compiler figures out the optimal memory access pattern. No manual thread indexing or bank conflict avoidance.

The PyTorch Connection

When you call torch.compile() on a model, TorchInductor generates Triton kernels for GPU execution. The fusion engine identifies sequences of pointwise operations (add, multiply, activation) that can be combined into single kernels. Instead of three separate kernels with intermediate HBM writes, you get one kernel that loads data once, performs all operations in registers, and stores once.

A fused LayerNorm + Linear that would require 500+ lines of optimized CUDA takes about 50 lines of Triton. The resulting kernel won’t match a hand-tuned CUTLASS implementation, but it’ll be close, and it takes hours to write instead of weeks.

FlashInfer: Built for Serving

FlashAttention changed attention computation by recognizing that the bottleneck was memory I/O, not FLOPs. By computing attention tile-by-tile in SRAM and never materializing the N×N attention matrix in HBM, it reduced memory access from O(N²) to O(N). This brought longer context lengths and faster training.

But FlashAttention was designed for training workloads with regular, rectangular batches. Production serving is messier.

The Serving Reality

In a real serving deployment:

Requests arrive with different context lengths (no neat rectangular batches)
The KV cache uses PagedAttention with non-contiguous memory blocks
Multiple requests share common prefixes (system prompts, document context)
CUDA graphs need static shapes, but batch composition changes every iteration

FlashAttention handles none of this natively. FlashInfer does.

What FlashInfer Adds

Block-sparse KV cache support: FlashInfer kernels operate on PagedAttention’s block-sparse representation directly. Page tables map logical token indices to physical memory blocks, and FlashInfer traverses them efficiently without requiring contiguous memory.

Ragged tensor layouts: Standard kernels assume rectangular batches, padding shorter sequences to match the longest. FlashInfer operates on “ragged” layouts where sequences are packed tightly. No wasted compute on padding tokens.

Plan/run separation: FlashInfer separates scheduling decisions from kernel execution. The “plan” phase precomputes work distribution based on current batch composition. The “run” phase executes with that plan. This separation enables CUDA graph capture—record the run phase once, replay it with different inputs.

Cascade attention: When multiple requests share a common prefix (a system prompt, a retrieved document), naive approaches recompute attention over that prefix for every request. FlashInfer’s cascade attention processes the shared prefix once, caches the result, and computes only the unique suffix per request. For a 32K shared prefix across 256 requests, this yields a 31x speedup.

Integration with vLLM

vLLM’s attention backend isn’t monolithic. A kernel selection layer examines the workload (hardware architecture, head dimension, precision, model type) and dispatches to the appropriate backend: FlashAttention for standard cases, FlashInfer for PagedAttention scenarios, Triton for specific configurations. This flexibility means you get optimized kernels for your actual workload, not a one-size-fits-all solution.

FlashAttention vs FlashInfer

FlashAttention optimized training. FlashInfer optimizes the messy reality of production serving.

FlashAttention: Padded batches

FlashInfer: Ragged batches

Padded Rectangular Batch

Req 1

Req 2

∅

Req 3

∅

Req 4

∅

Real tokens 18

Padding 14 (44% waste)

Ragged Packed Layout

Packed

Real tokens 18

Padding 0 (0% waste)

FlashInfer tracks sequence boundaries with offset arrays, enabling tight packing without wasted compute.

Scenario: 4 requests share a 32K token system prompt

Naive Approach

Request 1

Prefix (32K)

+512

Request 2

Prefix (32K)

+256

Request 3

Prefix (32K)

+128

Request 4

Prefix (32K)

+64

Prefix computed 4×

Total attention ~129K tokens

FlashInfer Cascade

Shared

Prefix (32K) → cache once

↓ cached result ↓

Request 1

cached

+512

Request 2

cached

+256

Req 3, 4...

cached

+...

Prefix computed 1×

Total attention ~33K tokens

31× speedup for 32K shared prefix across 256 requests

📦

Block-Sparse KV Cache

Native support for PagedAttention's non-contiguous memory blocks. Traverses page tables efficiently without requiring contiguous memory layouts.

📐

Ragged Tensor Layouts

Sequences packed tightly with no padding waste. Tracks boundaries via offset arrays for variable-length batches.

🔄

Plan/Run Separation

Precomputes work distribution in "plan" phase, enabling CUDA graph capture. Record once, replay with different inputs.

⚡

Cascade Attention

Processes shared prefixes once, caches results, computes only unique suffixes. Massive speedups for common system prompts.

Padding waste

31×

Cascade speedup

✓

CUDA graph compatible

NCCL: The Invisible Communication Backbone

Everything discussed so far assumes the model fits on a single GPU. For frontier models, it doesn’t. Llama-70B requires roughly 140GB in FP16—nearly two H100s worth of memory. Larger models require more.

Tensor parallelism splits the model across GPUs within a server. Weight matrices are sharded so each GPU holds a slice. Each GPU computes a partial result, and then… they have to talk to each other.

This is NCCL’s domain.

The Communication Pattern

Tensor parallelism using the Megatron-LM algorithm requires two AllReduce operations per transformer layer:

After attention output projection: Each GPU computed attention over its head subset. AllReduce combines the results.
After FFN down projection: Each GPU computed a partial FFN result. AllReduce sums them.

AllReduce means “sum tensors across all GPUs and distribute the result to all GPUs.” For Llama-70B on 4 GPUs, each AllReduce moves batch_size × sequence_length × hidden_dim × bytes_per_element bytes—and it happens 160 times per forward pass (2 per layer × 80 layers).

The Interconnect Gap

The choice of interconnect dominates multi-GPU inference performance:

Interconnect	Bandwidth
NVLink 4.0	900 GB/s bidirectional
PCIe Gen5	128 GB/s bidirectional

That’s a 7x gap. On NVLink, tensor parallelism adds modest overhead. On PCIe, communication becomes the bottleneck rather than memory bandwidth.

Even optimized, communication overhead consumes 20-35% of inference time for Llama-70B on 4×H100. It’s the reason single-GPU inference (when the model fits) is always preferable, and why quantization to fit larger models on fewer GPUs often improves overall throughput despite the precision loss.

NCCL AllReduce Patterns

How NVIDIA's collective communication library orchestrates multi-GPU data synchronization for tensor parallelism.

Ring AllReduce

Bandwidth Optimal

GPU 0 chunk A

GPU 1 chunk B

GPU 2 chunk C

GPU 3 chunk D

Data chunks flow around the ring

Latency

O(k)

Bandwidth

Optimal

Best for

Large msgs

Steps

2(k-1)

Tree AllReduce

Latency Optimal

GPU 0 root

GPU 1 child

GPU 2 child

GPU 3 leaf

Reduce up, broadcast down

Latency

O(log k)

Bandwidth

Sub-optimal

Best for

Small msgs

Steps

2 log(k)

NCCL automatically selects the optimal algorithm based on message size and GPU topology.

GPU Interconnect Bandwidth Comparison

NVLink 4.0

900 GB/s

PCIe Gen5

128

7× Gap

NVLink is essential for efficient tensor parallelism

Impact on LLM Inference (Llama-70B, 4×H100)

160

AllReduce ops/forward pass

20-35%

Time spent on communication

~30 KB

Per AllReduce (decode)

With PCIe, communication overhead can exceed 50%—making NVLink critical for multi-GPU inference.

🔄

Ring AllReduce

Maximizes bandwidth utilization by pipelining data transfers. Each GPU sends and receives simultaneously, achieving near-optimal throughput.

Best for: Prefill phase, gradient sync, large activation tensors (>1MB)

🌲

Tree AllReduce

Minimizes latency with logarithmic steps. Reduces to root, then broadcasts back. Fewer synchronization points but lower bandwidth efficiency.

Best for: Decode phase, small tensors, latency-critical paths (~30KB)

📊

High GPU Count

Ring scales well with many GPUs since bandwidth stays constant. Tree latency grows logarithmically but wastes bandwidth at scale.

8+ GPUs: Ring preferred for most operations

⚡

Latency Sensitive

When time-to-first-token matters more than throughput, tree's O(log k) steps beat ring's O(k) latency even at the cost of bandwidth.

Interactive inference, real-time applications

Modern NCCL uses hybrid algorithms—tree for small messages (<256KB) switching to ring for larger transfers.

Putting It Together

During decode, a single token flows through the entire stack: vLLM schedules the batch, PyTorch dispatches through CUDA graphs, and each transformer layer executes CUTLASS GEMMs for projections (with fused quantization), FlashInfer kernels for attention over the paged KV cache, and NCCL AllReduces if using tensor parallelism.

The time breakdown tells the story:

Attention kernels: 40-60%
FFN/MLP kernels: 30-40%
Communication (with TP): 20-35%
Everything else: <10%

Attention and FFN dominate. Both are memory-bound.

The Memory Bandwidth Endgame

Every library in this stack attacks the same fundamental constraint: memory bandwidth. CUTLASS enables fused kernels that minimize HBM round-trips. Triton makes writing such kernels accessible. FlashInfer optimizes attention’s memory access patterns. NCCL minimizes communication overhead that competes for the same memory bandwidth.

The hardware is evolving in the same direction. NVIDIA’s Blackwell B200 delivers 8 TB/s of HBM bandwidth, 2.4x more than H100, and introduces native FP4 support, halving bytes-per-parameter.

Understanding this stack is not just an academic exercise. If you’re deploying LLMs at scale, these libraries determine your cost per token, your latency percentiles, your maximum context length. The optimizations that matter aren’t in the model architecture; they’re in the software that maps that architecture onto silicon.

The iceberg runs deep. Now you know what’s beneath the surface.

The Iceberg Problem#

Hardware Contract#

NVIDIA H100 GPU Architecture

Streaming Multiprocessors

GPU Memory Hierarchy: The Bandwidth Wall

The Memory Wall

CUTLASS: Template Metaprogramming Foundation#

When cuBLAS Won’t Cut It#

Triton: GPU Programming Without the Pain#

The Mental Model Shift#

The PyTorch Connection#

FlashInfer: Built for Serving#

The Serving Reality#

What FlashInfer Adds#

Integration with vLLM#

NCCL: The Invisible Communication Backbone#

The Communication Pattern#

The Interconnect Gap#

Putting It Together#

The Memory Bandwidth Endgame#

The Iceberg Problem

Hardware Contract

CUTLASS: Template Metaprogramming Foundation

When cuBLAS Won’t Cut It

Triton: GPU Programming Without the Pain

The Mental Model Shift

The PyTorch Connection

FlashInfer: Built for Serving

The Serving Reality

What FlashInfer Adds

Integration with vLLM

NCCL: The Invisible Communication Backbone

The Communication Pattern

The Interconnect Gap

Putting It Together

The Memory Bandwidth Endgame