1. The Problem: LLMs Are Too Big for Your Phone
Imagine you want to run ChatGPT-level AI on your phone, completely offline — no internet, no cloud server, total privacy. Sounds great, right? The problem is that modern large language models (LLMs) are enormous.
GPT-3, released in 2020, has 175 billion parameters. Each parameter is typically stored as a 16-bit floating-point number (FP16), meaning it occupies 2 bytes of memory. Total memory for GPT-3: 350 GB. The best consumer GPUs have 24 GB of memory. Your phone has maybe 8-12 GB total RAM. Clearly, something has to give.
Even for "smaller" models like LLaMA-2-7B (7 billion parameters), you still need about 14 GB in FP16, which barely fits on a high-end GPU. For a mobile phone, this is completely out of reach.
Model quantization is the most promising solution. Instead of storing each weight as a 16-bit number, what if we stored it as a 4-bit integer? That's a 4× compression. LLaMA-2-7B would shrink from 14 GB to about 3.5 GB — suddenly plausible even for mobile devices.
But here's the catch: naively converting 16-bit weights to 4-bit integers (called "round-to-nearest" or RTN quantization) causes dramatic accuracy loss. The model starts generating nonsense. This paper, AWQ (Activation-aware Weight Quantization), solves this problem elegantly — and it was awarded Best Paper at MLSys 2024.
2. Prerequisites: What You Need to Know First
This section explains every concept from scratch. If you're a complete beginner, start here. If you know machine learning basics, you can skim to Section 3.
2.1 Neural Network Weights and Matrix Multiplication
A large language model like LLaMA or GPT-2 is, at its mathematical core, a very long sequence of matrix multiplications. The most fundamental building block is:
Here:
- is the input vector — a list of numbers representing what the model is currently processing (e.g., the meaning of a word encoded as 4096 numbers)
- is the weight matrix — a large grid of numbers that the model learned during training. A typical weight matrix in LLaMA-2-7B might be million numbers.
- is the output vector — the result of the computation
The model has thousands of these weight matrices. Their collective values encode the model's "knowledge" about language, facts, and reasoning. The total number of values across all weight matrices is the model's parameter count — 7 billion for LLaMA-2-7B.
Why this matters for quantization: Everything we do to compress the model amounts to replacing the original (stored in FP16) with a compressed version (stored in INT4) while keeping as close to as possible.
2.2 Floating-Point Numbers: FP32, FP16, and Why They Matter
Computers represent real numbers approximately using floating-point format. Think of scientific notation: . Floating-point numbers have:
- A sign (positive or negative)
- A mantissa (the significant digits, like 3.14)
- An exponent (the power of 2 or 10, like 2 above)
Different precisions allocate different numbers of bits:
| Format | Total Bits | Range | Precision |
|---|---|---|---|
| FP32 | 32 | ±3.4×10³⁸ | ~7 decimal digits |
| FP16 | 16 | ±65504 | ~3 decimal digits |
| BF16 | 16 | ±3.4×10³⁸ | ~2 decimal digits |
| INT8 | 8 | -128 to 127 | exact integers only |
| INT4 | 4 | -8 to 7 | exact integers only |
Modern LLMs are typically trained in FP32 but deployed in FP16 (a 2× memory reduction with minimal accuracy loss). Quantization further reduces to INT8 or INT4, giving 4-8× compression relative to FP16.
The key challenge: FP16 can represent about 65,000 distinct values in its range, while INT4 can represent only 16 distinct values (). Squeezing the rich information of a 16-bit weight into 4 bits inevitably loses precision — the question is how much.
2.3 What Is Quantization?
Quantization is the process of mapping continuous (or high-precision) values to a discrete set of lower-precision values.
The standard formula for uniform quantization is:
Where:
- is the original weight value (in FP16)
- is the number of quantization bits (e.g., for INT4)
- is the quantization scale — the step size between quantization levels
- rounds to the nearest integer
- is the quantized approximation
Let's trace through a concrete example. Suppose (4-bit), so we have possible integer values ( to ). If the maximum absolute weight value is , then:
A weight would be quantized to:
The quantization error for this weight is — small, but accumulated across billions of weights, these errors compound into significant model degradation.
2.4 The Quantization Error Problem
The challenge of low-bit quantization is best understood through perplexity comparison. For the OPT-6.7B model (a popular open-source LLM):
| Method | Bits | WikiText Perplexity |
|---|---|---|
| FP16 (full precision) | 16 | 10.86 |
| RTN (round-to-nearest) | INT3, g128 | 23.54 |
| AWQ | INT3, g128 | 11.39 |
RTN at 3-bit quantization nearly doubles the perplexity — meaning the model generates far less coherent text. AWQ brings it back almost to FP16 quality. How?
The key insight is that rounding errors are not equally harmful. Some weights, when corrupted by quantization, cause catastrophic degradation. Others can be quantized aggressively with minimal effect. The art is in identifying and protecting the critical ones.
2.5 Post-Training Quantization vs. Quantization-Aware Training
There are two main families of quantization approaches:
Quantization-Aware Training (QAT): During training, simulate low-bit arithmetic. The model learns to be robust to quantization errors. Works well but requires the full training process — for LLaMA-2 with 7B parameters trained on 2 trillion tokens, this means weeks on thousands of GPUs. Completely impractical for post-deployment compression.
Post-Training Quantization (PTQ): Take an already-trained model and compress it using a small calibration dataset (typically 128-512 text samples). No retraining. Much cheaper, but you have less ability to correct quantization errors.
AWQ is a post-training quantization method. It uses only a small calibration dataset to measure statistics (specifically, the average magnitude of activations per channel) — no gradient computation, no backpropagation. This makes it extremely fast and generalizable.
2.6 Weight-Only vs. Weight+Activation Quantization
Another important distinction:
W8A8 (Weight 8-bit, Activation 8-bit): Both weights and activations are quantized to INT8. Computations happen in INT8. Hardware can exploit fast INT8 matrix multiplication units. Example method: SmoothQuant.
W4A16 (Weight 4-bit, Activation 16-bit): Only weights are stored as INT4; activations remain FP16. At inference time, weights are dequantized back to FP16 on the fly before the matrix multiplication.
Why would W4A16 make sense? Look at the arithmetic:
- Weight memory: 7B parameters × 4 bits = 3.5 GB (vs. 14 GB for FP16)
- Activation memory: Much smaller (proportional to batch size × sequence length × hidden dim)
For on-device inference with batch size 1 (single user), weight loading from memory dominates. The generation stage is memory-bandwidth-bound — the bottleneck is how fast you can load weights from DRAM, not how fast you can compute. By quantizing weights to 4 bits, you reduce memory traffic by 4×, leading to nearly 4× speedup.
AWQ focuses on W4A16 quantization — exactly the right choice for on-device LLM deployment.
2.7 Perplexity: How We Measure Language Model Quality
Perplexity is the standard metric for evaluating language model quality on text generation tasks. Intuitively, it measures how "surprised" or "uncertain" the model is when it sees real text.
- Lower perplexity = better model (the model is less surprised, more confident)
- A perfect model that always predicts the next word correctly would have perplexity = 1
- A model that randomly guesses among 50,000 words would have perplexity ≈ 50,000
- Good LLMs typically have perplexity 5-15 on WikiText-2
Think of perplexity as an inverse measure of fluency. A model with perplexity 11 reads much more naturally than one with perplexity 23. In the AWQ experiments, the goal is to achieve perplexity as close as possible to the FP16 baseline (e.g., 10.86 for OPT-6.7B) even after aggressive 3-bit or 4-bit quantization.
2.8 Group Quantization and Per-Channel Scaling
Group quantization divides each weight row into small groups (e.g., 128 weights per group) and quantizes each group independently with its own scale . This is standard in modern LLM quantization:
- Notation: "INT3-g128" means 3-bit quantization with group size 128
- Smaller groups → more scale parameters stored → better accuracy but slightly higher overhead
- Group size 128 is the standard sweet spot
Per-channel scaling is a related concept: instead of using a single scale for an entire weight matrix, use one scale per output channel (one per row of ). AWQ goes further by applying a per-input-channel scaling factor that is embedded into the weights before quantization.
3. What AWQ Does: The Big Picture
AWQ makes one beautiful observation and builds an entire system around it:
Not all weights in an LLM are equally important. Protecting just 1% of them — the "salient" ones — dramatically reduces quantization error. And crucially, we can identify these salient weights not by looking at the weights themselves, but by looking at the activations that flow through them.
The naive approach to protecting important weights: keep them in FP16 while quantizing the rest to INT4. This is "mixed precision" quantization. It works (perplexity drops from 23.54 to 13.00 for OPT-6.7B at INT3), but has a major problem: hardware hates mixed precision. Real accelerators (GPUs, NPUs) are optimized for uniform data types. A weight matrix with 99% INT4 values and 1% FP16 values is awkward to work with and doesn't yield practical speedups.
AWQ's solution: instead of keeping salient weights in FP16, scale them up before quantization and scale the corresponding activations down. This is a mathematical equivalence transformation that:
- Reduces the quantization error for salient weights
- Keeps everything in INT4 (no mixed precision)
- Incurs zero extra computation cost at inference time
The scaling factor is determined by a simple grid search over activation statistics collected from a small calibration dataset — no backpropagation required.
4. The Core Insight: Not All Weights Are Equal
This section explains the crucial observation that makes AWQ work.
The experiment
The authors quantized OPT models using INT3-g128 and asked: what happens if we keep a tiny fraction of weight channels in FP16 instead of quantizing them?
| Method | FP16 kept | WikiText PPL (OPT-6.7B) |
|---|---|---|
| RTN (all quantized) | 0% | 23.54 |
| Keep top 1% by weight magnitude | 1% | 22.37 |
| Keep top 1% by activation magnitude | 1% | 11.39 |
| Keep top 1% randomly | 1% | 23.54 |
Three findings jump out:
- Keeping 1% of weights in FP16 based on activation magnitude reduces perplexity from 23.54 to 11.39 — nearly recovering FP16 quality (10.86).
- Keeping 1% based on weight magnitude (the naive choice) barely helps: 22.37 vs. 23.54.
- Random selection does nothing.
Why does activation magnitude identify important weights?
A weight in matrix connects input channel to output channel . The "importance" of this weight to the final model output depends not just on itself, but on how large the corresponding input typically is. If is typically large, then even a small error in gets amplified: the error in the output is , which grows with .
So we should look at input channel 's activation magnitude to judge whether the weights in that channel are important. Weight channels corresponding to large activations are more "salient" — their quantization errors have larger effect on the output.
This is the activation-awareness of AWQ: using input activation statistics to guide weight quantization decisions.
5. AWQ Method Details
5.1 Analyzing Quantization Error Mathematically
AWQ starts with a rigorous analysis. For a weight-only quantization setting, the linear layer computes:
After quantization, the weight becomes (where is a single weight value from ), and the output becomes .
Recall the quantization function:
The quantization error for a single element is:
where is the rounding residual, roughly uniformly distributed in , giving an expected absolute value of .
Note two things:
- The error is proportional to — larger activations amplify quantization error
- The error is proportional to — larger scale factors amplify error
5.2 The Activation-Aware Scaling Trick
Now suppose we multiply weight by a scaling factor before quantization and divide the activation by to compensate:
This is mathematically equivalent to in the ideal case (without quantization error), but behaves differently under quantization. The new scale becomes:
(approximately, assuming multiplying by doesn't change which element is the maximum — valid when is not too large).
The quantization error with scaling is:
Compare with Equation (3). The key ratio of new error to old error is:
Wait — they cancel out? Let's be more careful. The term is the rounding error of the new quantized value. Since , we have . The rounding error doesn't change much (it's still in ).
But the important thing is that for the salient channels (where is large), scaling up shifts the quantization grid to better fit . Specifically:
- changes (this is the key): the scaled weight might now sit closer to a grid point
- The factor reduces the final error for large values
Empirically, the authors find that for , the perplexity of OPT-6.7B under INT3 drops from 23.54 (no scaling) to 11.92 — nearly matching the 11.39 of mixed-precision FP16 (Table 2 in the paper).
Why doesn't scaling harm non-salient channels?
If is too large, it can increase the scale of the quantized group, making quantization coarser for all weights in the group, including non-salient ones. The authors observe that for , 21.2% of weights have changed quantization bins (i.e., the rounding goes to a different integer), which increases error for non-salient channels. The optimal balances protecting salient channels without harming others.
5.3 Searching for the Optimal Scale
Rather than hand-tuning a scale for each channel, AWQ defines an automatic search procedure.
The objective: Find per-channel scales that minimize the output reconstruction error after quantization:
Here:
- is the original weight matrix in FP16
- is the quantization function (INT3 or INT4)
- is a small set of input activations collected from the calibration dataset
- is a diagonal matrix of per-channel scales
- The scales multiply the weights (columns of ) and inversely divide the activations (rows of )
Since is not differentiable (it involves rounding), gradient-based optimization is unreliable. Instead, AWQ uses a simple grid search over a compact 1D search space.
The search space: Since the salient channels are identified by large activation magnitude, AWQ uses the average activation magnitude per input channel as a reference, and searches for the best power :
- : no scaling (all channels get scale 1) — equivalent to RTN
- : maximum scaling proportional to activation magnitude — most aggressive protection of salient channels
- Optimal typically falls around 0.5, balancing salient and non-salient channel protection
The grid search over with step size 0.05 (20 evaluations) is extremely fast — it requires only forward passes through the quantized layer, no backpropagation. This is why AWQ runs without any GPU training infrastructure.
Additional trick: weight clipping
The authors also apply weight clipping to minimize MSE quantization error, but this is a secondary optimization on top of the main scaling approach.
How the scaling is absorbed into inference
At inference time, the scale that should multiply the activations is instead fused into the preceding layer. For example, if the preceding layer has a bias or normalization, the scaling can be absorbed there. This means inference runs with zero overhead from the scaling operation — the quantized weights are simply stored with the scale baked in.
This equivalence transformation is the key to AWQ being hardware-friendly: everything remains in INT4, no mixed precision, no extra operations.
6. TinyChat: Deploying AWQ on Real Devices
Having a compression algorithm is only half the story. The other half is actually running fast inference on real hardware. The AWQ paper introduces TinyChat, a specialized inference system built to realize the theoretical speed gains from W4A16 quantization.
Why Converting Memory to Speed Is Non-Trivial
W4A16 quantization saves memory (4× reduction in weight footprint) but doesn't automatically translate to 4× speed. The challenge: modern hardware doesn't natively multiply INT4 weights by FP16 activations. You need to:
- Load INT4 weights from memory (fast — only 4 bits per weight)
- Dequantize them to FP16 on-the-fly
- Multiply FP16 weights by FP16 activations
The dequantization step is costly if done naively. TinyChat addresses this with two key techniques:
On-the-Fly Weight Dequantization with Kernel Fusion
Rather than dequantizing weights to DRAM (which would negate memory savings), TinyChat fuses the dequantization directly into the matrix multiplication kernel. The kernel reads INT4 weights from memory, converts them to FP16 registers, and immediately performs the multiply-accumulate — all without touching DRAM for the dequantized weights.
For GPU inference, AWQ kernels are implemented in CUDA/PTX; for CPU/ARM, they use NEON SIMD instructions.
SIMD-Aware Weight Packing (ARM NEON)
On ARM CPUs (phones, Apple Silicon, Raspberry Pi), SIMD (Single Instruction Multiple Data) registers process multiple values simultaneously. ARM NEON has 128-bit registers.
The challenge: 4-bit weights don't align naturally to byte boundaries. TinyChat uses a reordering and packing strategy:
- Each 128-bit NEON register holds 32 4-bit weights
- Weights are reordered offline so that a single 128-bit AND operation followed by a shift can unpack all 32 weights at once
- This requires just 3 SIMD instructions to unpack 32 weights, vs. 3 scalar instructions per weight in naive implementation
- Results in up to 1.2× speedup for the packing/unpacking alone
Kernel Fusion
TinyChat fuses multiple operations into single CUDA/NEON kernels:
- Layer normalization: Multiplication, division, and square root fused into one kernel
- Attention layers: QKV projections and positional encoding fused; KV cache pre-allocated and updated inside the attention kernel
- Benefits: Reduces kernel launch overhead and intermediate DRAM accesses
On NVIDIA 4090, each FP16 kernel launches in ~0.01ms. For models like Falcon and StarCoder with many sequential operations, kernel fusion gives substantial speedups.
7. Experiments and Results
7.1 Setup and Baselines
Quantization setting: Weight-only grouped quantization (INT4 and INT3 with group size 128). This is denoted as W4A16 or W3A16 — 4 or 3-bit weights, 16-bit activations.
Models evaluated:
- LLaMA family: 7B, 13B, 30B, 65B
- Llama-2 family: 7B, 13B, 70B
- OPT family: 1.3B, 2.7B, 6.7B, 13B, 30B
- Mistral-7B, Mixtral-8x7B (mixture of experts)
- Instruction-tuned: Vicuna-7B, Vicuna-13B
- Multi-modal: OpenFlamingo-9B, LLaVA-13B, VILA-7B, VILA-13B
Baselines:
- RTN (Round-to-Nearest): Simplest baseline, quantizes each group independently
- GPTQ: State-of-the-art PTQ using second-order (Hessian) information for error correction
- GPTQ-R: GPTQ with reordering trick for better performance
Calibration data: Pile dataset (Gao et al., 2020), grid size of 20 for search.
7.2 Language Model Results
Table 4: Llama-2 and LLaMA perplexity (WikiText-2)
| Model | FP16 | RTN (INT3) | GPTQ (INT3) | AWQ (INT3) |
|---|---|---|---|---|
| Llama-2-7B | 5.47 | 6.66 | 6.43 | 6.24 |
| Llama-2-13B | 4.88 | 5.52 | 5.48 | 5.32 |
| Llama-2-70B | 3.32 | 3.98 | 3.88 | 3.74 |
| LLaMA-7B | 5.68 | 7.01 | 8.81 | 6.35 |
| LLaMA-13B | 5.09 | 5.88 | 5.66 | 5.52 |
| LLaMA-30B | 4.10 | 4.88 | 4.88 | 4.61 |
| LLaMA-65B | 3.53 | 4.24 | 4.17 | 3.95 |
Key observations:
- AWQ consistently beats both RTN and GPTQ at INT3 quantization
- At INT4, the gains are smaller (all methods work reasonably well at 4-bit) but AWQ still achieves the best perplexity
- For LLaMA-7B at INT3, GPTQ dramatically underperforms (8.81 PPL) while AWQ achieves 6.35 — this is because GPTQ overfits to the calibration set, hurting generalization
OPT models show similar trends. AWQ at INT3-g128 achieves:
- OPT-1.3B: 16.32 PPL (vs. FP16: 14.62, RTN: 119.47)
- OPT-6.7B: 11.39 PPL (vs. FP16: 10.86, RTN: 23.54)
- OPT-13B: 10.56 PPL (vs. FP16: 10.13, RTN: 46.04)
7.3 Instruction-Tuned Models (Vicuna)
Instruction-tuned models (fine-tuned to follow human instructions) are harder to quantize because fine-tuning changes the weight distribution in subtle ways. The authors evaluate using GPT-4 as a judge on 80 questions.
Figure 5 results for Vicuna-7B (INT3-g128, GPT-4 evaluation):
- AWQ wins on 23 questions, ties on 5, loses on 52... vs. FP16 counterpart
- vs. RTN: AWQ wins on 6 and ties on 3 more questions
- vs. GPTQ: AWQ wins 11 more questions
The "wins" represent questions where the quantized model gives better-judged answers than FP16 (which can happen since quantization can act as slight regularization). More importantly, AWQ shows fewer losses compared to RTN and GPTQ.
7.4 Multi-Modal Models (OpenFlamingo, LLaVA, VILA)
This is a first for LLM quantization research — applying W4A16 quantization to vision-language models.
OpenFlamingo-9B on COCO Captioning (CIDEr↑):
| Method | INT4-g128 | INT3-g128 |
|---|---|---|
| FP16 | 81.70 (32-shot) | - |
| RTN | 77.13 | 64.79 |
| GPTQ | 74.98 | 64.77 |
| AWQ | 80.53 | 74.47 |
AWQ reduces the quantization degradation from 4.57 to just 1.17 CIDEr points at INT4 — a 4× model size reduction with negligible performance loss.
VILA-7B and VILA-13B on 11 visual-language benchmarks (Table 7): AWQ consistently shows lossless or near-lossless quantization performance across all 11 benchmarks including VQAv2, GQA, VizWiz, SQA, POPE, MME, MMB, SEED, LLaVA-bench, and MM-Vet.
The success on multi-modal models demonstrates that AWQ's activation-based salient weight detection generalizes across modalities — the same calibration strategy works whether the model processes text or images.
7.5 Coding and Math Benchmarks
CodeLlama-7b-Instruct on MBPP (programming) and GSM8K (math), INT4-g128:
| Model | FP16 | RTN | GPTQ | AWQ |
|---|---|---|---|---|
| MBPP pass@1 | 38.53 | 37.51 | 31.97 | 40.64 |
| MBPP pass@10 | 49.77 | 48.49 | 44.75 | 49.25 |
| Llama-2-7B GSM8K | 13.87 | 11.07 | 12.13 | 13.57 |
| Llama-2-13B GSM8K | 26.16 | 21.23 | 24.26 | 25.12 |
| Llama-2-70B GSM8K | 56.41 | 53.98 | 56.03 | 56.40 |
Remarkably, AWQ on MBPP exceeds FP16 performance (40.64 vs. 38.53 pass@1) — quantization as slight regularization can sometimes improve task-specific performance. For GSM8K, AWQ at INT4 nearly matches FP16 performance across all model scales.
7.6 Speedup Results (TinyChat)
RTX 4090 (desktop GPU) — tokens/sec:
| Model | FP16 (Huggingface) | AWQ + TinyChat |
|---|---|---|
| Llama-2-7B | 62 | 194 (3.1×) |
| Llama-2-13B | 52 | 110 (2.1×) |
| MPT-7B | 85 | 158 (1.9×) |
| MPT-30B | OOM | 49 |
| Falcon-7B | 53 | 53 (1.0×) |
Jetson Orin (mobile GPU) — tokens/sec:
| Model | FP16 | AWQ + TinyChat |
|---|---|---|
| Llama-2-7B | 12 | 39 (3.3×) |
| Llama-2-13B | 9 | 25 (2.8×) |
| MPT-30B | OOM | 12 |
| Falcon-7B | 12 | 38 (3.2×) |
Key results:
- Consistent 3-4× speedup over FP16 Huggingface on both desktop and mobile GPUs
- Enables models that couldn't fit in memory (MPT-30B = OOM in FP16) to run at interactive speeds
- Llama-2-70B: deployable on a single NVIDIA Jetson Orin with 64GB memory
VILA visual-language models on multiple platforms (Table 10):
- VILA-7B: 155.3 tokens/sec (4090) vs. 81.6 (FP16) — 1.9× speedup
- VILA-13B: 102.1 tokens/sec (4090) — runs where FP16 is OOM
8. Limitations and Open Questions
AWQ is an excellent method, but it has several limitations worth understanding:
8.1 Still Requires Calibration Data
AWQ needs a small calibration dataset to measure average activation magnitudes per channel. While the dataset is small (16 sequences of 2048 tokens, 10× less than GPTQ's 192 sequences), it still:
- Requires access to representative data for the target domain
- Introduces a mild dependence on calibration distribution
The authors show AWQ is more robust to calibration distribution mismatch than GPTQ (Figure 8b — PPL increase of only 0.5-0.6 vs. GPTQ's 2.3-4.9 when calibration and evaluation domains don't match). But this robustness is a matter of degree, not elimination.
8.2 Simplified Search Space
The scaling search uses a single parameter per layer, with . This constrains the search to a 1D curve in the space of all possible per-channel scales. A more expressive search might find better scales, at the cost of more computation.
8.3 Extreme Low-Bit Quantization (INT2)
At very aggressive 2-bit quantization (INT2), AWQ degrades significantly and GPTQ's reconstruction-based approach helps. The authors show that combining AWQ + GPTQ (Table 9) achieves the best INT2 results — AWQ alone is insufficient here.
8.4 Generative Tasks vs. Discriminative Tasks
AWQ is evaluated primarily on generative tasks (text generation, captioning). For classification or embedding tasks where the model architecture and use pattern differ, the activation magnitude heuristic may be less valid.
8.5 Quantization of Certain Architectures
Some architectures (e.g., attention layers with specific normalization strategies) may have different activation distribution patterns. The authors tested Mistral and Mixtral (Mixture-of-Experts) and found AWQ generalizes well, but the scaling heuristic could theoretically fail for exotic architectures.
8.6 No Formal Convergence Guarantee
The search in Equation (5) is a heuristic grid search. There is no theoretical guarantee that found by the search is globally optimal. The 1D grid search is practical and effective empirically, but a more rigorous optimization (with accompanying theory) would strengthen the method.
9. Reproducibility and Code
AWQ is among the most practically impactful quantization methods and has excellent open-source support.
Official Implementation: https://github.com/mit-han-lab/llm-awq
TinyChat: Available within the same repository, supporting CUDA, ARM NEON, and Apple Metal backends.
Ecosystem Adoption: AWQ has been integrated into virtually every major LLM serving framework:
- HuggingFace Transformers (
AutoAWQForCausalLM) - NVIDIA TensorRT-LLM
- Microsoft DirectML
- Google Vertex AI
- Intel Neural Compressor
- Amazon SageMaker
- vLLM, LMDeploy, FastChat
Reproducing the key experiment (Table 3 from the paper — OPT-6.7B INT3-g128):
1 | pip install awq |
Expected result: WikiText-2 PPL ≈ 11.39 (vs. RTN's 23.54, FP16's 10.86)
Minimal implementation of the core AWQ scaling search (pseudocode):
1 | def awq_quantize_layer(W, X, bits=4, group_size=128, grid_size=20): |
10. Connection to the Bigger Picture: Compression Landscape
AWQ sits in the broader landscape of LLM compression, alongside methods that target different axes:
10.1 Relation to GPTQ
Both AWQ and GPTQ are post-training quantization methods targeting the same problem. Their key differences:
| Aspect | AWQ | GPTQ |
|---|---|---|
| Core principle | Activation-aware scaling | Hessian-based error correction |
| Calibration data | 16 sequences | 192 sequences |
| Backpropagation | No | No (but uses Hessian) |
| Overfitting risk | Low | Higher (Table 4, LLaMA-7B INT3) |
| Orthogonality | Yes — can combine | Yes — can combine |
The authors show in Table 9 that AWQ + GPTQ combined outperforms either alone at extreme low-bit (INT2) quantization. They are complementary, not competing.
10.2 Relation to SVD-Based Compression
AWQ is a weight-space compression method. An alternative approach is low-rank decomposition via SVD (Singular Value Decomposition):
Given weight matrix , compute:
where captures the dominant singular values. Methods like AdaLoRA and ASVD (activation-aware SVD) use this principle.
The connection to AWQ: ASVD uses activation statistics similarly to AWQ — scaling rows and columns of by activation magnitudes before performing SVD, so the decomposition focuses compression budget on less-activated directions. This is the same activation-awareness principle, applied to a different compression primitive (rank reduction vs. quantization).
Key difference: Quantization preserves the exact matrix structure but reduces precision; SVD changes the matrix structure (it becomes a product of two smaller matrices) but preserves precision. Quantization generally achieves better compression ratios for the same accuracy on large models.
10.3 Relation to Knowledge Distillation and Pruning
The compression methods for LLMs form a landscape:
- Quantization (AWQ, GPTQ, BitNet): Reduce precision of weights
- Pruning (SparseGPT, Wanda): Remove weights entirely (set to zero)
- Low-rank decomposition (SVD, LoRA): Replace full matrices with low-rank products
- Knowledge distillation: Train a smaller model to mimic a larger one
AWQ focuses on quantization and is orthogonal to pruning and distillation — in principle, you could combine all three.
11. Conclusion
AWQ (Activation-aware Weight Quantization) is a landmark contribution to on-device LLM deployment. Let's recap its key contributions:
The intellectual insight: Weight importance in LLMs is determined by the activations flowing through them, not by the weights themselves. A 1% minority of "salient" weight channels — those corresponding to large-magnitude activations — are responsible for the majority of quantization-induced accuracy degradation.
The technical solution: Rather than using hardware-unfriendly mixed precision, AWQ applies a mathematically equivalent per-channel scaling that protects salient weights within the uniform INT4 format. The scaling factor is found by a fast, calibration-free grid search over activation statistics.
The system contribution: TinyChat translates the theoretical 4× memory savings into 3-4× real-world inference speedup on both desktop GPUs and mobile devices, through careful kernel fusion, SIMD-aware weight packing, and platform-specific optimization.
The impact: AWQ is now the de facto standard for 4-bit LLM quantization in production systems. It enables the Llama-2-70B model — which previously required 8 high-end GPUs — to run on a single mobile GPU (NVIDIA Jetson Orin with 64GB), or the 13B model to run interactively on a laptop GPU (RTX 4070 with 8GB).
The paper asks a simple question — "which weights matter most?" — and finds a simple answer: look at the activations. This is the essence of activation-awareness, and it's a principle that will likely continue to guide compression research for years to come.
Reviewed by Zhongzhu Zhou on 2026-04-03. Paper: AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration arXiv: 2306.00978 | MLSys 2024 Best Paper Award