
The Hidden Software Stack Behind Fast LLM Inference
Beyond vLLM and PagedAttention: exploring NCCL, CUTLASS, Triton, and FlashInfer, the libraries that actually make LLM inference fast.

Beyond vLLM and PagedAttention: exploring NCCL, CUTLASS, Triton, and FlashInfer, the libraries that actually make LLM inference fast.
How speculative decoding achieves 2-3× inference speedup without changing model outputs, and why GLM-4.7’s native multi-token prediction marks a paradigm shift.
Flash Attention, a memory-efficient attention mechanism for transformers.
A deep dive into NVIDIA’s H100 architecture and the monitoring techniques required for production-grade LLM inference optimization.