Introduction: Why We Can’t Afford Full Precision Anymore
The numbers tell a stark story. A model like GPT-3.5, with its 175 billion parameters, demands 700GB of memory at full precision, enough to consume thousands of dollars in cloud costs every single day. LLaMA-2-70B requires 280GB at full precision and 140GB even with standard FP16, numbers that dwarf the memory capacity of most GPUs. Training such models can cost millions of dollars in compute resources, and even running inference requires an infrastructure of multiple high-end GPUs costing $15,000 each.
These requirements create more than just a financial barrier. They represent a fundamental accessibility problem. Consumer GPUs like the RTX 3090 offer only 24GB of VRAM, while even the newest RTX 5090 provides just 32GB nowhere near enough for unquantized large models. Mobile and edge devices face even tighter constraints with 4-8GB of total RAM. Without quantization, state-of-the-art models remain locked behind expensive data center infrastructure, inaccessible to researchers, startups, and individual developers.
The computational burden compounds these memory constraints. Matrix multiplication dominates LLM inference time, and high-precision arithmetic is expensive. On NVIDIA A100, lower-precision tensor cores provide substantially higher throughput—e.g., TF32 is around ~156 TFLOPS dense (≈312 with sparsity), while INT8 performance reaches into the hundreds of TOPS depending on sparsity and kernels. Memory bandwidth creates additional bottlenecks: moving 140GB of weights from GPU memory to compute units can take longer than the computation itself, especially for the small batch sizes typical in interactive applications.
This is where quantization transforms from optimization technique to enabling technology. By compressing neural network weights from 32 bits to 4 bits or lower, quantization can achieve 4-8x memory reduction and 2-4x computational speedup while maintaining small accuracy losses with proper techniques. Modern quantization methods are transforming LLM deployment from multi-GPU clusters to single consumer GPUs—often reducing total system costs from six figures to a few thousand dollars in certain setups (e.g., 4-bit with offloading or multi-GPU).
The field has evolved dramatically. What began as “can we quantize below 8 bits?” in 2022 has progressed to early deployments and research systems exploring 2-bit models by 2025, with clear paths emerging for both research and real-world applications.
GPU Memory Requirements by Model Size and Precision
Memory footprint comparison across quantization formats for LLM deployment
Part 1: The Foundation - How Computers Store Numbers
To understand how we can compress these models, we must first understand what we’re compressing. Machine learning algorithms don’t process text; they process numbers, and the format used to store these numbers dictates their range, accuracy, and the memory they consume.
The 32-Bit Standard: FP32 and Its Anatomy
The default numerical format in most deep learning frameworks is the 32-bit single-precision floating-point number, commonly known as FP32. Defined by the IEEE 754 standard, an FP32 number occupies 32 bits (4 bytes) of memory, divided into three distinct parts that work together to represent a vast range of real numbers.
The sign bit (1 bit) is straightforward—a value of 0 indicates a positive number, while 1 indicates negative. The exponent (8 bits) determines the magnitude or range of the number, functioning like the exponent in scientific notation by scaling the value up or down by powers of 2. To represent both very large and very small numbers, the 8-bit unsigned integer uses a technique called exponent bias. For FP32, the bias is 127, meaning the actual exponent equals the stored value minus 127. This allows the 8 bits to represent an exponent range from -126 to +127 without requiring a separate sign bit for the exponent itself.
The mantissa (23 bits), also known as the significand, determines the precision of the number—essentially, how many significant digits it can accurately represent. The mantissa is a binary fraction normalized to be between 1.0 and 2.0. Because the leading digit of a normalized binary number in this format is always 1, this “implied leading 1” doesn’t need storage. This clever trick effectively gives the mantissa 24 bits of precision while only using 23 bits of memory.
FP32 (32-bit Float) Bit-Level Anatomy
IEEE 754 single-precision floating-point format
Sign Bit
Exponent
Mantissa
Reconstruction Formula
Key Insights
- 32 bits = 4 bytes per number
- Exponent bias (127) enables wide dynamic range without a signed exponent
- Implied leading 1 in mantissa gives 24 effective precision bits
- Range: ±1.4 × 10^-45 to ±3.4 × 10^38
- Precision: ~7 decimal digits
This structure gives FP32 a dynamic range of approximately ±1.4×10⁻⁴⁵ to ±3.4×10³⁸. The epsilon (smallest representable difference) equals 2⁻²³ ≈ 0.00000012. For a 175B parameter model, FP32 representation demands 700GB of memory at 4 bytes per parameter.
However, a key limitation of any finite binary representation is that it cannot perfectly represent all decimal numbers. Just as 1/3 cannot be written with a finite number of decimal digits, values like 0.1 cannot be represented exactly in binary, leading to minor rounding errors in computation.
The Half-Precision Compromise: FP16 and BFloat16
While FP32 provides a good balance of range and precision, its 32-bit size is a primary contributor to the massive memory footprint of LLMs. Two 16-bit (half-precision) formats have emerged, each with a different solution to the fundamental tradeoff between range and precision.
FP16 (Half-Precision) compresses the format to 1 sign bit, 5 exponent bits (bias=15), and 10 mantissa bits. This dramatic reduction in exponent range means FP16 only spans ±6×10⁻⁵ to ±65,504. With only 3-4 significant digits and epsilon of 0.00097656, FP16 risks overflow and underflow during model training, where gradients can become extremely small or extremely large, causing the training process to fail.
The memory advantage is substantial: 175B parameters require only 350GB, halving the footprint while enabling 2x faster computation on Tensor Core GPUs. Mixed precision training exploits this by computing in FP16 but accumulating gradients in FP32, though loss scaling is needed to prevent underflow.
BFloat16 (BF16), developed by Google Brain specifically for deep learning, takes a radically different approach. It allocates 1 bit for the sign, 8 bits for the exponent (same as FP32), and just 7 bits for the mantissa. By maintaining FP32’s full dynamic range (±1.2×10⁻³⁸ to ±3.4×10³⁸) while sacrificing precision (epsilon = 0.0078125), BF16 becomes a “drop-in replacement” for FP32 in training.
Converting between FP32 and BF16 is trivial, simply truncate or zero-pad the last 16 bits making it computationally cheap. The identical exponent range means no overflow issues and no need for loss scaling during training. BF16 has become the preferred format for training large models.
The emergence and widespread adoption of BFloat16 reveals a core principle of deep learning systems: system-level stability is often more critical than component-level precision. The industry’s willingness to sacrifice the precision of individual numbers (BF16 has 3 fewer mantissa bits than FP16) for the stability of the entire training process demonstrates that deep neural networks are remarkably robust to a certain level of numerical noise. The most critical failure mode during training is not a slight inaccuracy in a single weight but a catastrophic gradient explosion or vanishment.
Integer Quantization: INT8 and Beyond
INT8 (8-bit integer) abandons floating point entirely, representing values as integers from -128 to 127 (signed) or 0 to 255 (unsigned). Quantization maps continuous weights to these 256 discrete values using a scale factor S and zero-point Z through the formula: x_q = round(x/S + Z). Dequantization reverses this: x = S × (x_q - Z).
The scale and zero-point are typically stored in higher precision, adding some overhead. Advanced techniques use per-channel quantization with different (S, Z) pairs for each output channel, dramatically improving accuracy. LLM.int8() goes further, identifying outlier features with magnitude >6 and keeping them in FP16 while quantizing the rest, achieving <0.5% degradation on 176B models.
At 1 byte per parameter, INT8 reduces a 175B model to approximately 175-200GB including overhead—a 75% reduction from FP32. Computational benefits are substantial: INT8 tensor cores deliver 2.3-4x speedup over FP32 in practice, though realizing these gains requires specialized kernels. The challenge lies in selecting appropriate quantization ranges: too narrow loses information through clipping, too wide wastes the limited 256 values on rarely-used extremes. Block-wise quantization divides parameters into groups of 64-128, computing separate scales for each block to limit outlier impact.
INT4 (4-bit integer) pushes compression to extremes with only 16 representable values. Standard INT4 uses -8 to 7, but neural network weights cluster around zero with an approximately normal distribution. NormalFloat4 (NF4) exploits this by placing quantization points at the quantiles of a standard normal distribution, optimizing for neural network weight distributions rather than uniform spacing. QLoRA uses NF4 with 64-weight blocks and double quantization (quantizing the scale factors themselves to 8-bit), compressing a 65B model from 130GB to roughly 40–50GB with careful overhead management.
The memory savings enable previously impossible deployments: a 70B parameter model can fit on a single 48–64GB GPU using INT4; fitting into 32GB typically requires significant offloading/pruning and tight KV‑cache constraints. However, computation typically dequantizes weights to FP16 for matrix multiplication since native 4-bit arithmetic lacks broad hardware support. This means memory bandwidth benefits dominate over raw computational speedup.
Feature | FP32 | FP16 | BFloat16 | INT8 | INT4 |
---|---|---|---|---|---|
Total Bits | 32 | 16 | 16 | 8 | 4 |
Sign Bits | 1 | 1 | 1 | - | - |
Exponent Bits | 8 | 5 | 8 | - | - |
Mantissa Bits | 23 (+1 implied) | 10 (+1 implied) | 7 (+1 implied) | - | - |
Dynamic Range | ±1.4e-45 to ±3.4e38 | ±6e-5 to ±65,504 | ±1.2e-38 to ±3.4e38 | -128 to 127 | -8 to 7 |
Precision | 7 digits | 3-4 digits | 2-3 digits | 256 levels | 16 levels |
[Placeholder for diagram: Memory footprint comparison showing LLaMA-70B across precisions with GPU memory capacity lines]
Part 2: Understanding Quantization - The Accuracy-Efficiency Tradeoff
With a foundation in numerical representation, we can now explore the process of quantization itself. At its heart, quantization is a mapping function from a large, often continuous set of values to a smaller, discrete set.
The Mechanics: Mapping Continuous to Discrete
Quantization converts model parameters from a high-precision data type like FP32 to a low-precision one, most commonly an 8-bit integer (INT8). An INT8 variable can only represent 256 distinct values, a stark contrast to the billions of values representable by FP32. This mapping is achieved through a linear transformation known as the affine quantization scheme.
The core formula relates the original real value (r) to its quantized integer counterpart (q) using two key parameters: a scale (S) and a zero-point (Z):
r = S × (q - Z)
Rearranging for quantization gives: q = round(r/S + Z)
The scale is a positive floating-point number that acts as the step size of the quantization. It defines the ratio of the original floating-point range to the target integer range, calculated as (r_max - r_min) / (q_max - q_min).
The zero-point is an integer within the quantized range that corresponds exactly to the floating-point value 0.0. This is critical because the value zero holds special significance in neural networks—it’s used for padding in convolutions and serves as the threshold for activation functions like ReLU. Ensuring that 0.0 can be perfectly represented without error after quantization is essential for maintaining model accuracy.
Affine Quantization: Continuous to Discrete Mapping
How floating-point values are linearly mapped to integer representations
Scale (S)
Zero-Point (Z)
Ranges
Mapping Visualization
Discrete Quantization Buckets
Key Insights
- Linear mapping: r = S × (q − Z) is an affine transformation
- Scale (S): Step size; smaller S = finer granularity
- Zero-point (Z): Ensures r = 0.0 maps to an integer
- Symmetric: Z = 0 (common for weights)
- Asymmetric: Z ≠ 0 (common for activations)
- Error: Rounding introduces ±S/2 quantization error
Symmetric vs. Asymmetric Quantization
The zero-point concept leads to two primary quantization schemes, each with distinct tradeoffs.
Asymmetric (Affine) Quantization is the general form where the zero-point can be any integer in the quantized range. This scheme excels at quantizing data whose distribution is not centered around zero. A prime example is the output of a ReLU activation function, where all values are non-negative. Asymmetric quantization can map the range [0.0, 1000.0] to the full integer range, maximizing the use of available precision.
Symmetric Quantization is a special case where the floating-point range is forced to be symmetric around zero (e.g., [-a, a]). This constraint ensures that floating-point 0.0 maps directly to integer 0, making the zero-point Z = 0. The primary advantage is computational efficiency—since Z = 0, the subtraction operation in the dequantization formula can be skipped, leading to faster execution on some hardware.
However, if the underlying data distribution is skewed (like after a ReLU), symmetric quantization can be wasteful, as half of the quantized range will go unused, effectively losing one bit of precision.
The Quantization Timeline: PTQ vs. QAT
Quantization methods are categorized not just by their mathematical scheme but by when they’re applied in the model’s lifecycle.
Post-Training Quantization (PTQ) applies quantization to a model that has already been fully trained in high precision. The process typically involves passing a small “calibration dataset” (a few hundred representative examples) through the model to observe the ranges of its weights and activations. These observed ranges are then used to calculate the optimal scale and zero-point parameters for each tensor.
The advantages are compelling: PTQ is fast, simple, and computationally inexpensive. It doesn’t require access to the original training pipeline or large datasets, making it highly accessible. However, because the model’s weights were optimized for a high-precision environment, abruptly forcing them into a low-precision format can introduce significant “quantization noise,” leading to noticeable accuracy drops, especially at very low bit-widths (e.g., 4-bit).
Quantization-Aware Training (QAT) simulates the effects of quantization during the training or fine-tuning process. It works by inserting “fake quantization” operations into the model’s computation graph. In the forward pass, weights and activations are quantized and then immediately dequantized back to a floating-point format. This simulates the error that will be introduced during low-precision inference. Crucially, the backward pass computes gradients with respect to the original full-precision weights, allowing the model to learn parameters that are inherently robust to quantization effects.
QAT almost always achieves higher accuracy than PTQ, often recovering nearly all of the original model’s performance, even at aggressive quantization levels. However, it’s a far more complex and computationally expensive process, requiring retraining or extensive fine-tuning with access to the training dataset and significant computational resources.
The choice between PTQ and QAT presents a fundamental dilemma for LLMs. The models that would benefit most from QAT’s superior accuracy are the very ones for which the method is computationally and financially prohibitive. Fine-tuning a model with hundreds of billions of parameters can require hundreds of gigabytes of GPU memory, making QAT impractical for all but the largest institutions. This has led to PTQ becoming the dominant paradigm for LLM quantization, “not for its superiority but feasibility.”
This critical gap—the need for QAT-level accuracy with PTQ-level efficiency—has been the primary driver behind the intense research and development of advanced PTQ algorithms like GPTQ.
Feature | Post-Training Quantization (PTQ) | Quantization-Aware Training (QAT) |
---|---|---|
Workflow | Quantize a fully trained model | Simulate quantization during training/fine-tuning |
Computational Cost | Low (calibration pass only) | High (requires retraining) |
Data Requirement | Small calibration dataset | Full training dataset |
Typical Accuracy | Good at 8-bit, may degrade at lower bit-widths | Excellent, often near full-precision performance |
Best For | Scenarios with limited resources, no access to training data, or when speed of deployment is critical | Applications where maximizing accuracy is paramount and computational resources are available |
The Accuracy-Efficiency Frontier
The choice of numeric format involves subtle tradeoffs that extend beyond simple memory calculations. Accuracy degradation patterns differ across precisions, model architectures, and quantization methods, with certain failure modes appearing only at extreme compression.
From FP16 to INT8, degradation is often very small when using proper techniques. Reports on BLOOM‑176B show differences within typical measurement noise on many tasks, demonstrating that INT8 can be virtually lossless for large models when outliers (e.g., >6σ features) are handled separately in FP16.
INT4 quantization shows noticeable but often acceptable degradation (commonly a few percentage points) depending on method and model. GPTQ frequently keeps 4‑bit perplexity deltas small on WikiText for large models, though exact results vary by setup. Modern techniques like AWQ can approach FP16 performance on certain tasks for specific models/configs. The difference between methods matters significantly—AWQ’s activation‑aware approach outperforms naive rounding by protecting the most activation‑sensitive weights.
Dramatic failure often occurs at 2 bits without specialized methods. Vanilla GPTQ typically fails at 2 bits, while methods like SpQR can make 2‑bit feasible by identifying and isolating a subset of weights as outliers (kept in higher precision) while quantizing the rest to 2 bits.
Model size affects quantization tolerance non-linearly. Larger models generally quantize better because individual weight errors average out across more parameters and layers. A 70B model tolerates 4-bit quantization with 96-99% accuracy recovery, while smaller 7B models show more variability. Counter-intuitively, models trained on more data become harder to quantize—LLaMA 3’s 15 trillion training tokens create more complex weight distributions than earlier models, increasing quantization sensitivity.
Memory savings follow predictable patterns but include important overhead. The formula Memory = Parameters × Bytes_per_parameter × 1.2
captures typical overhead from scale factors, activation tensors, and KV cache. For LLaMA‑70B: FP16 needs ~140–148GB, INT8 requires ~70–74GB (≈2× compression), while INT4 uses ~35–45GB (≈3.5× compression). The KV cache for attention adds substantial overhead at long context lengths: at 128K tokens an 8B model can consume on the order of tens of GB in FP16, sometimes exceeding the quantized model weights themselves.
[Placeholder for diagram: Accuracy-efficiency Pareto frontier showing perplexity degradation vs memory reduction for different quantization methods]
Part 3: GPTQ - When Second-Order Thinking Meets Quantization
GPTQ (Generative Pre-trained Transformer Quantization) represents a breakthrough in post-training quantization. Published at ICLR 2023 by researchers from IST Austria and ETH Zurich, GPTQ enables 3-4 bit compression of 175B parameter models in approximately 4 GPU hours while maintaining negligible performance degradation.
The Core Problem and GPTQ’s Insight
The fundamental challenge with quantization is this: when you quantize a weight, you introduce error. Naively rounding weights to the nearest quantization level (Round-to-Nearest or RTN) performs acceptably at 8 bits but fails catastrophically at 3-4 bits, essentially destroying model capability.
GPTQ asks a different question: how can we quantize weights while compensating for the error by adjusting other weights to maintain the layer’s output?
For each layer, GPTQ solves the optimization problem:
argmin_Ŵ ||WX - ŴX||²
where W is the original weight matrix, Ŵ is the quantized version, and X represents layer inputs from calibration data. This minimizes the squared difference between full-precision and quantized layer outputs rather than focusing on weight values themselves. The key realization: we care about preserving behavior (outputs) not parameter values.
This objective decomposes into independent row-wise problems since the squared Frobenius norm sums across rows. Processing each row separately reduces computational complexity dramatically while remaining theoretically sound.
The Engine: Optimal Brain Quantization and the Hessian
GPTQ’s innovation is built upon a classic algorithm from the 1990s called Optimal Brain Quantization (OBQ). OBQ provides a principled way to quantize weights by using second-order information to guide the process.
This information is captured in the Hessian matrix H = XX^T (with damping λI), which contains the second derivatives of the model’s loss function with respect to its weights. Intuitively, the Hessian describes the “curvature” of the loss landscape. A sharp curve in a particular direction (a large corresponding value in the Hessian) means the model’s loss is highly sensitive to changes in that weight. Conversely, a flat curve (a small Hessian value) indicates that the weight can be changed with little impact on the loss.
GPTQ: Understanding the Hessian Through Loss Curvature
The Hessian (H = 2XXT) captures second-order information about loss sensitivity
Loss Landscape Cross-Section
Comparing All Weight Sensitivities
Key Insights
- First-order (gradient): direction of steepest descent
- Second-order (Hessian): curvature (steepness)
- High curvature: small changes → large loss increase
- Low curvature: changes have smaller effect
- GPTQ: leverage H-1 to pick low-sensitivity weights first and compensate error
GPTQ uses the inverse of the Hessian matrix H⁻¹ = (XX^T + λI)⁻¹, where the damping term λ (typically a small fraction of the average diagonal) prevents numerical instability. The OBQ algorithm quantizes weights one by one. At each step, it must decide which weight to quantize next. The optimal choice is the one that will cause the smallest increase in the layer’s output error, guided by the diagonal entries of the inverse Hessian matrix.
The Algorithm: Error Compensation in Action
The core of the OBQ method, and by extension GPTQ, is an iterative process of error compensation within each layer:
- Select & Quantize: A single weight is chosen and quantized (e.g., rounded to the nearest 4-bit representable value)
- Measure Error: The algorithm calculates the error introduced by this rounding step
- Compensate: This is the crucial step. The algorithm updates all the other, not-yet-quantized full-precision weights in the layer to compensate for the error just introduced. This update is not uniform; it is scaled by the inverse Hessian, which directs the correction towards related but less sensitive weights that can absorb the error with minimal impact on the layer’s output
After quantizing weight w_q at position q to its nearest grid point, the quantization error must be compensated. The update formula:
δ = -[(w_q - quant(w_q)) / [H⁻¹]_qq] · (H⁻¹)_:,q
redistributes this error across remaining unquantized weights, minimizing impact on layer output.
This iterative compensation is what makes GPTQ so accurate. It doesn’t just round weights independently; it actively and intelligently corrects for the rounding error at every single step, ensuring the final quantized layer behaves as closely as possible to the original.
After each quantization, the Hessian inverse must be updated by removing the quantized weight’s row and column. Gaussian elimination provides the update, but this accumulates numerical error. GPTQ’s solution uses Cholesky decomposition to precompute all required Hessian information in a numerically stable manner, preventing error accumulation that would otherwise corrupt billion-parameter models.
Making It Practical: GPTQ’s Efficiency Optimizations
While powerful, the original OBQ algorithm is far too slow for modern LLMs. Its greedy search for the next-best weight to quantize and the need to update the inverse Hessian after every single weight result in cubic runtime complexity that is prohibitive. The genius of GPTQ lies in three clever optimizations:
Arbitrary Order Quantization: The authors made a critical empirical discovery—for very large, overparameterized models, the specific order in which weights are quantized has minimal impact on final accuracy. GPTQ thus abandons the expensive greedy search of OBQ and instead quantizes weights in a simple, fixed order (e.g., column by column). This not only eliminates the search but also means the Hessian information can be shared across all rows of a weight matrix, dramatically reducing redundant computations.
Lazy Batch Updates: To make the algorithm friendly to modern GPUs, which thrive on parallel computation, updates to the Hessian are batched. Instead of performing a small update after every single weight, GPTQ processes a block of columns (a group_size of 128 is common) at a time. This significantly improves the compute-to-memory-access ratio, leading to massive speedups.
Cholesky Decomposition: To ensure the complex matrix inverse operations remain numerically stable and efficient throughout the process, GPTQ employs this standard numerical linear algebra technique.
GPTQ’s success is a testament to brilliant research engineering. It bridges the gap between purely heuristic methods (like simple rounding) and computationally prohibitive, fully principled methods (like QAT). Its true innovation was not inventing new mathematical theory from scratch, but rather identifying the key computational bottlenecks in a powerful existing algorithm (OBQ) and devising pragmatic approximations (fixed order, lazy updates) that were shown to work exceptionally well at the massive scale of modern LLMs.
GPTQ: Block-wise Quantization with Hessian-Guided Error Compensation
Watch the three nested loops process weights column-by-column with intelligent error redistribution
Current Operation
Hessian Inverse
Calibration requires surprisingly little data. GPTQ uses 128 random 2048-token segments from C4 (Colossal Clean Crawled Corpus)—approximately 262,144 tokens of generic web text. This zero-shot approach requires no task-specific data, making quantization fast and broadly applicable.
On a single NVIDIA A100 80GB, GPTQ quantizes OPT-175B in 4.2 hours and BLOOM-176B in 3.8 hours. Memory requirements are manageable: load one Transformer block (typically 6 layers) at a time, accumulate Hessians, quantize, then pass inputs through the quantized block to generate inputs for the next block.
Part 4: Beyond GPTQ - Alternative Approaches
GPTQ represents one approach among several competing quantization methods, each with distinct strengths:
AWQ (Activation-aware Weight Quantization) has emerged as a primary alternative to GPTQ, achieving strong accuracy by protecting a small fraction of activation‑sensitive weights. AWQ often matches FP16 performance on some tasks and is widely supported by fast 4‑bit inference kernels (e.g., AWQ/Marlin), which in many stacks can be faster than pipelines targeting GPTQ‑formatted weights.
LLM.int8() focuses on 8-bit quantization with near-zero degradation through mixed precision, keeping outlier features in FP16. While limited to 2x compression versus GPTQ’s 4x, it provides the most reliable accuracy preservation.
GGUF/llama.cpp targets CPU inference with excellent cross-platform support, using mixed bit-width “k-quant” formats ideal for edge deployment on Apple Silicon and consumer hardware.
For practitioners, AWQ and GPTQ represent the current sweet spot for 4-bit GPU inference, offering 96-99% accuracy recovery with 3.5-4x compression. The choice between methods depends on specific accuracy requirements, inference speed priorities, and deployment constraints.
Conclusion
Quantization has transformed from academic curiosity to production necessity, enabling LLM deployment from datacenter to edge. The progression from “can we quantize below 8 bits?” to practical 4-bit deployment reflects both algorithmic breakthroughs and growing infrastructure demands.
GPTQ’s core innovation, using second-order Hessian information to guide quantization and compensate for errors—proved that 3-4 bit compression is viable with minimal accuracy loss. By minimizing layer output error rather than weight error, GPTQ enables 70B models to run on consumer GPUs that previously required expensive multi-GPU clusters.
The field continues evolving rapidly. AWQ has emerged as a strong alternative with superior speed-accuracy tradeoffs, while advanced methods push toward 2-bit quantization. For practitioners today, 4-bit quantization with GPTQ or AWQ represents the sweet spot: 96-99% accuracy recovery with 3.5-4x memory reduction, making frontier models accessible on modest hardware.
The future of machine learning is quantized. These techniques have fundamentally democratized access to state-of-the-art models, transforming deployment from a privilege of well-funded organizations to a capability available to individual researchers and developers worldwide. Of course. Here is a list of the 13 references, sorted by their relevance to the main topics and narrative flow of your blog post.
The list begins with the paper central to your article (GPTQ), followed by its primary alternatives and the foundational techniques that underpin them. It then provides the historical context for the core algorithm before concluding with the fundamental standards and hardware specifications that motivate the entire field.
References
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. In International Conference on Learning Representations (ICLR).
Lin, J., Tang, J., Tang, H., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. In Proceedings of Machine Learning and Systems (MLSys).
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv preprint arXiv:2208.07339.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. In Advances in Neural Information Processing Systems (NeurIPS).
Jacob, B., Kligys, S., Chen, B., et al. (2018). Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Krishnamoorthi, R. (2018). Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342.
Dettmers, T., Svirschevski, R., Egiazarian, V., et al. (2024). SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. In International Conference on Learning Representations (ICLR).
Wang, S., & Kanwar, P. (2019, August 23). BFloat16: The secret to high performance on Cloud TPUs. Google Cloud Blog.
Institute of Electrical and Electronics Engineers. (2019). IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019.
NVIDIA. (2020). NVIDIA A100 Tensor Core GPU Architecture. Whitepaper.
Gerganov, G., et al. (2023). ggml-org/llama.cpp: LLM inference in C/C++. GitHub repository.