0%

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration — In-Depth Technical Review

1. The Problem: LLMs Are Too Big for Your Phone

Imagine you want to run ChatGPT-level AI on your phone, completely offline — no internet, no cloud server, total privacy. Sounds great, right? The problem is that modern large language models (LLMs) are enormous.

GPT-3, released in 2020, has 175 billion parameters. Each parameter is typically stored as a 16-bit floating-point number (FP16), meaning it occupies 2 bytes of memory. Total memory for GPT-3: 350 GB. The best consumer GPUs have 24 GB of memory. Your phone has maybe 8-12 GB total RAM. Clearly, something has to give.

Even for "smaller" models like LLaMA-2-7B (7 billion parameters), you still need about 14 GB in FP16, which barely fits on a high-end GPU. For a mobile phone, this is completely out of reach.

Model quantization is the most promising solution. Instead of storing each weight as a 16-bit number, what if we stored it as a 4-bit integer? That's a 4× compression. LLaMA-2-7B would shrink from 14 GB to about 3.5 GB — suddenly plausible even for mobile devices.

But here's the catch: naively converting 16-bit weights to 4-bit integers (called "round-to-nearest" or RTN quantization) causes dramatic accuracy loss. The model starts generating nonsense. This paper, AWQ (Activation-aware Weight Quantization), solves this problem elegantly — and it was awarded Best Paper at MLSys 2024.


2. Prerequisites: What You Need to Know First

This section explains every concept from scratch. If you're a complete beginner, start here. If you know machine learning basics, you can skim to Section 3.

2.1 Neural Network Weights and Matrix Multiplication

A large language model like LLaMA or GPT-2 is, at its mathematical core, a very long sequence of matrix multiplications. The most fundamental building block is:

y=Wxy = Wx

Here:

  • xx is the input vector — a list of numbers representing what the model is currently processing (e.g., the meaning of a word encoded as 4096 numbers)
  • WW is the weight matrix — a large grid of numbers that the model learned during training. A typical weight matrix in LLaMA-2-7B might be 4096×4096=164096 \times 4096 = 16 million numbers.
  • yy is the output vector — the result of the computation

The model has thousands of these weight matrices. Their collective values encode the model's "knowledge" about language, facts, and reasoning. The total number of values across all weight matrices is the model's parameter count — 7 billion for LLaMA-2-7B.

Why this matters for quantization: Everything we do to compress the model amounts to replacing the original WW (stored in FP16) with a compressed version W^\hat{W} (stored in INT4) while keeping y=W^xy = \hat{W}x as close to y=Wxy = Wx as possible.

2.2 Floating-Point Numbers: FP32, FP16, and Why They Matter

Computers represent real numbers approximately using floating-point format. Think of scientific notation: 3.14×102=3143.14 \times 10^2 = 314. Floating-point numbers have:

  • A sign (positive or negative)
  • A mantissa (the significant digits, like 3.14)
  • An exponent (the power of 2 or 10, like 2 above)

Different precisions allocate different numbers of bits:

Format Total Bits Range Precision
FP32 32 ±3.4×10³⁸ ~7 decimal digits
FP16 16 ±65504 ~3 decimal digits
BF16 16 ±3.4×10³⁸ ~2 decimal digits
INT8 8 -128 to 127 exact integers only
INT4 4 -8 to 7 exact integers only

Modern LLMs are typically trained in FP32 but deployed in FP16 (a 2× memory reduction with minimal accuracy loss). Quantization further reduces to INT8 or INT4, giving 4-8× compression relative to FP16.

The key challenge: FP16 can represent about 65,000 distinct values in its range, while INT4 can represent only 16 distinct values (24=162^4 = 16). Squeezing the rich information of a 16-bit weight into 4 bits inevitably loses precision — the question is how much.

2.3 What Is Quantization?

Quantization is the process of mapping continuous (or high-precision) values to a discrete set of lower-precision values.

The standard formula for uniform quantization is:

Q(w)=ΔRound ⁣(wΔ),Δ=max(w)2N1Q(w) = \Delta \cdot \text{Round}\!\left(\frac{w}{\Delta}\right), \quad \Delta = \frac{\max(|w|)}{2^{N-1} }

Where:

  • ww is the original weight value (in FP16)
  • NN is the number of quantization bits (e.g., N=4N=4 for INT4)
  • Δ\Delta is the quantization scale — the step size between quantization levels
  • Round()\text{Round}(\cdot) rounds to the nearest integer
  • Q(w)Q(w) is the quantized approximation

Let's trace through a concrete example. Suppose N=4N = 4 (4-bit), so we have 24=162^4 = 16 possible integer values (8-8 to 77). If the maximum absolute weight value is max(w)=2.0\max(|w|) = 2.0, then:

Δ=2.023=2.08=0.25\Delta = \frac{2.0}{2^3} = \frac{2.0}{8} = 0.25

A weight w=1.7w = 1.7 would be quantized to:

Q(1.7)=0.25Round ⁣(1.70.25)=0.25Round(6.8)=0.25×7=1.75Q(1.7) = 0.25 \cdot \text{Round}\!\left(\frac{1.7}{0.25}\right) = 0.25 \cdot \text{Round}(6.8) = 0.25 \times 7 = 1.75

The quantization error for this weight is 1.751.70=0.05|1.75 - 1.70| = 0.05 — small, but accumulated across billions of weights, these errors compound into significant model degradation.

2.4 The Quantization Error Problem

The challenge of low-bit quantization is best understood through perplexity comparison. For the OPT-6.7B model (a popular open-source LLM):

Method Bits WikiText Perplexity
FP16 (full precision) 16 10.86
RTN (round-to-nearest) INT3, g128 23.54
AWQ INT3, g128 11.39

RTN at 3-bit quantization nearly doubles the perplexity — meaning the model generates far less coherent text. AWQ brings it back almost to FP16 quality. How?

The key insight is that rounding errors are not equally harmful. Some weights, when corrupted by quantization, cause catastrophic degradation. Others can be quantized aggressively with minimal effect. The art is in identifying and protecting the critical ones.

2.5 Post-Training Quantization vs. Quantization-Aware Training

There are two main families of quantization approaches:

Quantization-Aware Training (QAT): During training, simulate low-bit arithmetic. The model learns to be robust to quantization errors. Works well but requires the full training process — for LLaMA-2 with 7B parameters trained on 2 trillion tokens, this means weeks on thousands of GPUs. Completely impractical for post-deployment compression.

Post-Training Quantization (PTQ): Take an already-trained model and compress it using a small calibration dataset (typically 128-512 text samples). No retraining. Much cheaper, but you have less ability to correct quantization errors.

AWQ is a post-training quantization method. It uses only a small calibration dataset to measure statistics (specifically, the average magnitude of activations per channel) — no gradient computation, no backpropagation. This makes it extremely fast and generalizable.

2.6 Weight-Only vs. Weight+Activation Quantization

Another important distinction:

W8A8 (Weight 8-bit, Activation 8-bit): Both weights and activations are quantized to INT8. Computations happen in INT8. Hardware can exploit fast INT8 matrix multiplication units. Example method: SmoothQuant.

W4A16 (Weight 4-bit, Activation 16-bit): Only weights are stored as INT4; activations remain FP16. At inference time, weights are dequantized back to FP16 on the fly before the matrix multiplication.

Why would W4A16 make sense? Look at the arithmetic:

  • Weight memory: 7B parameters × 4 bits = 3.5 GB (vs. 14 GB for FP16)
  • Activation memory: Much smaller (proportional to batch size × sequence length × hidden dim)

For on-device inference with batch size 1 (single user), weight loading from memory dominates. The generation stage is memory-bandwidth-bound — the bottleneck is how fast you can load weights from DRAM, not how fast you can compute. By quantizing weights to 4 bits, you reduce memory traffic by 4×, leading to nearly 4× speedup.

AWQ focuses on W4A16 quantization — exactly the right choice for on-device LLM deployment.

2.7 Perplexity: How We Measure Language Model Quality

Perplexity is the standard metric for evaluating language model quality on text generation tasks. Intuitively, it measures how "surprised" or "uncertain" the model is when it sees real text.

Perplexity=exp ⁣(1Ni=1NlogP(wiw1,,wi1))\text{Perplexity} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(w_i \mid w_1, \ldots, w_{i-1})\right)

  • Lower perplexity = better model (the model is less surprised, more confident)
  • A perfect model that always predicts the next word correctly would have perplexity = 1
  • A model that randomly guesses among 50,000 words would have perplexity ≈ 50,000
  • Good LLMs typically have perplexity 5-15 on WikiText-2

Think of perplexity as an inverse measure of fluency. A model with perplexity 11 reads much more naturally than one with perplexity 23. In the AWQ experiments, the goal is to achieve perplexity as close as possible to the FP16 baseline (e.g., 10.86 for OPT-6.7B) even after aggressive 3-bit or 4-bit quantization.

2.8 Group Quantization and Per-Channel Scaling

Group quantization divides each weight row into small groups (e.g., 128 weights per group) and quantizes each group independently with its own scale Δ\Delta. This is standard in modern LLM quantization:

  • Notation: "INT3-g128" means 3-bit quantization with group size 128
  • Smaller groups → more scale parameters stored → better accuracy but slightly higher overhead
  • Group size 128 is the standard sweet spot

Per-channel scaling is a related concept: instead of using a single scale for an entire weight matrix, use one scale per output channel (one per row of WW). AWQ goes further by applying a per-input-channel scaling factor ss that is embedded into the weights before quantization.


3. What AWQ Does: The Big Picture

AWQ makes one beautiful observation and builds an entire system around it:

Not all weights in an LLM are equally important. Protecting just 1% of them — the "salient" ones — dramatically reduces quantization error. And crucially, we can identify these salient weights not by looking at the weights themselves, but by looking at the activations that flow through them.

The naive approach to protecting important weights: keep them in FP16 while quantizing the rest to INT4. This is "mixed precision" quantization. It works (perplexity drops from 23.54 to 13.00 for OPT-6.7B at INT3), but has a major problem: hardware hates mixed precision. Real accelerators (GPUs, NPUs) are optimized for uniform data types. A weight matrix with 99% INT4 values and 1% FP16 values is awkward to work with and doesn't yield practical speedups.

AWQ's solution: instead of keeping salient weights in FP16, scale them up before quantization and scale the corresponding activations down. This is a mathematical equivalence transformation that:

  1. Reduces the quantization error for salient weights
  2. Keeps everything in INT4 (no mixed precision)
  3. Incurs zero extra computation cost at inference time

The scaling factor is determined by a simple grid search over activation statistics collected from a small calibration dataset — no backpropagation required.


4. The Core Insight: Not All Weights Are Equal

This section explains the crucial observation that makes AWQ work.

The experiment

The authors quantized OPT models using INT3-g128 and asked: what happens if we keep a tiny fraction of weight channels in FP16 instead of quantizing them?

Method FP16 kept WikiText PPL (OPT-6.7B)
RTN (all quantized) 0% 23.54
Keep top 1% by weight magnitude 1% 22.37
Keep top 1% by activation magnitude 1% 11.39
Keep top 1% randomly 1% 23.54

Three findings jump out:

  1. Keeping 1% of weights in FP16 based on activation magnitude reduces perplexity from 23.54 to 11.39 — nearly recovering FP16 quality (10.86).
  2. Keeping 1% based on weight magnitude (the naive choice) barely helps: 22.37 vs. 23.54.
  3. Random selection does nothing.

Why does activation magnitude identify important weights?

A weight wijw_{ij} in matrix WW connects input channel jj to output channel ii. The "importance" of this weight to the final model output depends not just on wijw_{ij} itself, but on how large the corresponding input xjx_j typically is. If xjx_j is typically large, then even a small error in wijw_{ij} gets amplified: the error in the output is Δwijxj\Delta w_{ij} \cdot x_j, which grows with xjx_j.

So we should look at input channel jj's activation magnitude xj|x_j| to judge whether the weights in that channel are important. Weight channels corresponding to large activations are more "salient" — their quantization errors have larger effect on the output.

This is the activation-awareness of AWQ: using input activation statistics to guide weight quantization decisions.


5. AWQ Method Details

5.1 Analyzing Quantization Error Mathematically

AWQ starts with a rigorous analysis. For a weight-only quantization setting, the linear layer computes:

y=Wxy = \mathbf{W}\mathbf{x}

After quantization, the weight becomes Q(w)Q(\mathbf{w}) (where w\mathbf{w} is a single weight value from W\mathbf{W}), and the output becomes Q(w)xQ(\mathbf{w}) \cdot \mathbf{x}.

Recall the quantization function:

Q(w)=ΔRound ⁣(wΔ),Δ=max(w)2N1Q(\mathbf{w}) = \Delta \cdot \text{Round}\!\left(\frac{\mathbf{w} }{\Delta}\right), \quad \Delta = \frac{\max(|\mathbf{w}|)}{2^{N-1} }

The quantization error for a single element is:

Err(Q(w)x)=ΔRoundErr ⁣(wΔ)x(3)\text{Err}(Q(w) \cdot x) = \Delta \cdot \text{RoundErr}\!\left(\frac{w}{\Delta}\right) \cdot x \tag{3}

where RoundErr()=Round()()\text{RoundErr}(\cdot) = \text{Round}(\cdot) - (\cdot) is the rounding residual, roughly uniformly distributed in [0.5,0.5][-0.5, 0.5], giving an expected absolute value of 0.250.25.

Note two things:

  1. The error is proportional to xx — larger activations amplify quantization error
  2. The error is proportional to Δ\Delta — larger scale factors amplify error

5.2 The Activation-Aware Scaling Trick

Now suppose we multiply weight ww by a scaling factor s>1s > 1 before quantization and divide the activation xx by ss to compensate:

Q(ws)xsQ(w \cdot s) \cdot \frac{x}{s}

This is mathematically equivalent to Q(w)xQ(w) \cdot x in the ideal case (without quantization error), but behaves differently under quantization. The new scale becomes:

Δ=max(ws)2N1sΔ\Delta' = \frac{\max(|w \cdot s|)}{2^{N-1} } \approx s \cdot \Delta

(approximately, assuming multiplying by ss doesn't change which element is the maximum — valid when ss is not too large).

The quantization error with scaling is:

Err ⁣(Q(ws)xs)=ΔRoundErr ⁣(wsΔ)x1s(3b)\text{Err}\!\left(Q(w \cdot s) \cdot \frac{x}{s}\right) = \Delta' \cdot \text{RoundErr}\!\left(\frac{ws}{\Delta'}\right) \cdot x \cdot \frac{1}{s} \tag{3b}

Compare with Equation (3). The key ratio of new error to old error is:

ΔΔ1ssΔΔ1s=1\frac{\Delta'}{\Delta} \cdot \frac{1}{s} \approx \frac{s\Delta}{\Delta} \cdot \frac{1}{s} = 1

Wait — they cancel out? Let's be more careful. The term RoundErr(ws/Δ)\text{RoundErr}(ws/\Delta') is the rounding error of the new quantized value. Since ΔsΔ\Delta' \approx s\Delta, we have ws/Δw/Δws/\Delta' \approx w/\Delta. The rounding error doesn't change much (it's still in [0.5,0.5][-0.5, 0.5]).

But the important thing is that for the salient channels (where xx is large), scaling up ss shifts the quantization grid to better fit wsw \cdot s. Specifically:

  • RoundErr(ws/Δ)\text{RoundErr}(ws/\Delta') changes (this is the key): the scaled weight wsws might now sit closer to a grid point
  • The 1/s1/s factor reduces the final error for large xx values

Empirically, the authors find that for s=2s=2, the perplexity of OPT-6.7B under INT3 drops from 23.54 (no scaling) to 11.92 — nearly matching the 11.39 of mixed-precision FP16 (Table 2 in the paper).

Why doesn't scaling harm non-salient channels?

If ss is too large, it can increase the scale Δ\Delta of the quantized group, making quantization coarser for all weights in the group, including non-salient ones. The authors observe that for s=4s=4, 21.2% of weights have changed quantization bins (i.e., the rounding goes to a different integer), which increases error for non-salient channels. The optimal s=2s=2 balances protecting salient channels without harming others.

5.3 Searching for the Optimal Scale

Rather than hand-tuning a scale for each channel, AWQ defines an automatic search procedure.

The objective: Find per-channel scales s\mathbf{s} that minimize the output reconstruction error after quantization:

s=argminsL(s),L(s)=Q(Wdiag(s))(diag(s)1X)WX(4)\mathbf{s}^* = \arg\min_{\mathbf{s} } \mathcal{L}(\mathbf{s}), \quad \mathcal{L}(\mathbf{s}) = \|Q(\mathbf{W} \cdot \text{diag}(\mathbf{s}))(\text{diag}(\mathbf{s})^{-1} \cdot \mathbf{X}) - \mathbf{W}\mathbf{X}\| \tag{4}

Here:

  • W\mathbf{W} is the original weight matrix in FP16
  • Q()Q(\cdot) is the quantization function (INT3 or INT4)
  • X\mathbf{X} is a small set of input activations collected from the calibration dataset
  • diag(s)\text{diag}(\mathbf{s}) is a diagonal matrix of per-channel scales
  • The scales multiply the weights (columns of WW) and inversely divide the activations (rows of XX)

Since Q()Q(\cdot) is not differentiable (it involves rounding), gradient-based optimization is unreliable. Instead, AWQ uses a simple grid search over a compact 1D search space.

The search space: Since the salient channels are identified by large activation magnitude, AWQ uses the average activation magnitude per input channel sX\mathbf{s}_\mathbf{X} as a reference, and searches for the best power α\alpha:

s=sXα,α=argminαL(sXα)(5)\mathbf{s} = \mathbf{s}_\mathbf{X}^\alpha, \quad \alpha^* = \arg\min_\alpha \mathcal{L}(\mathbf{s}_\mathbf{X}^\alpha) \tag{5}

  • α=0\alpha = 0: no scaling (all channels get scale 1) — equivalent to RTN
  • α=1\alpha = 1: maximum scaling proportional to activation magnitude — most aggressive protection of salient channels
  • Optimal α\alpha typically falls around 0.5, balancing salient and non-salient channel protection

The grid search over α[0,1]\alpha \in [0, 1] with step size 0.05 (20 evaluations) is extremely fast — it requires only forward passes through the quantized layer, no backpropagation. This is why AWQ runs without any GPU training infrastructure.

Additional trick: weight clipping

The authors also apply weight clipping to minimize MSE quantization error, but this is a secondary optimization on top of the main scaling approach.

How the scaling is absorbed into inference

At inference time, the scale s1\mathbf{s}^{-1} that should multiply the activations is instead fused into the preceding layer. For example, if the preceding layer has a bias or normalization, the scaling can be absorbed there. This means inference runs with zero overhead from the scaling operation — the quantized weights are simply stored with the scale baked in.

This equivalence transformation is the key to AWQ being hardware-friendly: everything remains in INT4, no mixed precision, no extra operations.


6. TinyChat: Deploying AWQ on Real Devices

Having a compression algorithm is only half the story. The other half is actually running fast inference on real hardware. The AWQ paper introduces TinyChat, a specialized inference system built to realize the theoretical speed gains from W4A16 quantization.

Why Converting Memory to Speed Is Non-Trivial

W4A16 quantization saves memory (4× reduction in weight footprint) but doesn't automatically translate to 4× speed. The challenge: modern hardware doesn't natively multiply INT4 weights by FP16 activations. You need to:

  1. Load INT4 weights from memory (fast — only 4 bits per weight)
  2. Dequantize them to FP16 on-the-fly
  3. Multiply FP16 weights by FP16 activations

The dequantization step is costly if done naively. TinyChat addresses this with two key techniques:

On-the-Fly Weight Dequantization with Kernel Fusion

Rather than dequantizing weights to DRAM (which would negate memory savings), TinyChat fuses the dequantization directly into the matrix multiplication kernel. The kernel reads INT4 weights from memory, converts them to FP16 registers, and immediately performs the multiply-accumulate — all without touching DRAM for the dequantized weights.

For GPU inference, AWQ kernels are implemented in CUDA/PTX; for CPU/ARM, they use NEON SIMD instructions.

SIMD-Aware Weight Packing (ARM NEON)

On ARM CPUs (phones, Apple Silicon, Raspberry Pi), SIMD (Single Instruction Multiple Data) registers process multiple values simultaneously. ARM NEON has 128-bit registers.

The challenge: 4-bit weights don't align naturally to byte boundaries. TinyChat uses a reordering and packing strategy:

  • Each 128-bit NEON register holds 32 4-bit weights
  • Weights are reordered offline so that a single 128-bit AND operation followed by a shift can unpack all 32 weights at once
  • This requires just 3 SIMD instructions to unpack 32 weights, vs. 3 scalar instructions per weight in naive implementation
  • Results in up to 1.2× speedup for the packing/unpacking alone

Kernel Fusion

TinyChat fuses multiple operations into single CUDA/NEON kernels:

  • Layer normalization: Multiplication, division, and square root fused into one kernel
  • Attention layers: QKV projections and positional encoding fused; KV cache pre-allocated and updated inside the attention kernel
  • Benefits: Reduces kernel launch overhead and intermediate DRAM accesses

On NVIDIA 4090, each FP16 kernel launches in ~0.01ms. For models like Falcon and StarCoder with many sequential operations, kernel fusion gives substantial speedups.


7. Experiments and Results

7.1 Setup and Baselines

Quantization setting: Weight-only grouped quantization (INT4 and INT3 with group size 128). This is denoted as W4A16 or W3A16 — 4 or 3-bit weights, 16-bit activations.

Models evaluated:

  • LLaMA family: 7B, 13B, 30B, 65B
  • Llama-2 family: 7B, 13B, 70B
  • OPT family: 1.3B, 2.7B, 6.7B, 13B, 30B
  • Mistral-7B, Mixtral-8x7B (mixture of experts)
  • Instruction-tuned: Vicuna-7B, Vicuna-13B
  • Multi-modal: OpenFlamingo-9B, LLaVA-13B, VILA-7B, VILA-13B

Baselines:

  • RTN (Round-to-Nearest): Simplest baseline, quantizes each group independently
  • GPTQ: State-of-the-art PTQ using second-order (Hessian) information for error correction
  • GPTQ-R: GPTQ with reordering trick for better performance

Calibration data: Pile dataset (Gao et al., 2020), grid size of 20 for α\alpha search.

7.2 Language Model Results

Table 4: Llama-2 and LLaMA perplexity (WikiText-2)

Model FP16 RTN (INT3) GPTQ (INT3) AWQ (INT3)
Llama-2-7B 5.47 6.66 6.43 6.24
Llama-2-13B 4.88 5.52 5.48 5.32
Llama-2-70B 3.32 3.98 3.88 3.74
LLaMA-7B 5.68 7.01 8.81 6.35
LLaMA-13B 5.09 5.88 5.66 5.52
LLaMA-30B 4.10 4.88 4.88 4.61
LLaMA-65B 3.53 4.24 4.17 3.95

Key observations:

  1. AWQ consistently beats both RTN and GPTQ at INT3 quantization
  2. At INT4, the gains are smaller (all methods work reasonably well at 4-bit) but AWQ still achieves the best perplexity
  3. For LLaMA-7B at INT3, GPTQ dramatically underperforms (8.81 PPL) while AWQ achieves 6.35 — this is because GPTQ overfits to the calibration set, hurting generalization

OPT models show similar trends. AWQ at INT3-g128 achieves:

  • OPT-1.3B: 16.32 PPL (vs. FP16: 14.62, RTN: 119.47)
  • OPT-6.7B: 11.39 PPL (vs. FP16: 10.86, RTN: 23.54)
  • OPT-13B: 10.56 PPL (vs. FP16: 10.13, RTN: 46.04)

7.3 Instruction-Tuned Models (Vicuna)

Instruction-tuned models (fine-tuned to follow human instructions) are harder to quantize because fine-tuning changes the weight distribution in subtle ways. The authors evaluate using GPT-4 as a judge on 80 questions.

Figure 5 results for Vicuna-7B (INT3-g128, GPT-4 evaluation):

  • AWQ wins on 23 questions, ties on 5, loses on 52... vs. FP16 counterpart
  • vs. RTN: AWQ wins on 6 and ties on 3 more questions
  • vs. GPTQ: AWQ wins 11 more questions

The "wins" represent questions where the quantized model gives better-judged answers than FP16 (which can happen since quantization can act as slight regularization). More importantly, AWQ shows fewer losses compared to RTN and GPTQ.

7.4 Multi-Modal Models (OpenFlamingo, LLaVA, VILA)

This is a first for LLM quantization research — applying W4A16 quantization to vision-language models.

OpenFlamingo-9B on COCO Captioning (CIDEr↑):

Method INT4-g128 INT3-g128
FP16 81.70 (32-shot) -
RTN 77.13 64.79
GPTQ 74.98 64.77
AWQ 80.53 74.47

AWQ reduces the quantization degradation from 4.57 to just 1.17 CIDEr points at INT4 — a 4× model size reduction with negligible performance loss.

VILA-7B and VILA-13B on 11 visual-language benchmarks (Table 7): AWQ consistently shows lossless or near-lossless quantization performance across all 11 benchmarks including VQAv2, GQA, VizWiz, SQA, POPE, MME, MMB, SEED, LLaVA-bench, and MM-Vet.

The success on multi-modal models demonstrates that AWQ's activation-based salient weight detection generalizes across modalities — the same calibration strategy works whether the model processes text or images.

7.5 Coding and Math Benchmarks

CodeLlama-7b-Instruct on MBPP (programming) and GSM8K (math), INT4-g128:

Model FP16 RTN GPTQ AWQ
MBPP pass@1 38.53 37.51 31.97 40.64
MBPP pass@10 49.77 48.49 44.75 49.25
Llama-2-7B GSM8K 13.87 11.07 12.13 13.57
Llama-2-13B GSM8K 26.16 21.23 24.26 25.12
Llama-2-70B GSM8K 56.41 53.98 56.03 56.40

Remarkably, AWQ on MBPP exceeds FP16 performance (40.64 vs. 38.53 pass@1) — quantization as slight regularization can sometimes improve task-specific performance. For GSM8K, AWQ at INT4 nearly matches FP16 performance across all model scales.

7.6 Speedup Results (TinyChat)

RTX 4090 (desktop GPU) — tokens/sec:

Model FP16 (Huggingface) AWQ + TinyChat
Llama-2-7B 62 194 (3.1×)
Llama-2-13B 52 110 (2.1×)
MPT-7B 85 158 (1.9×)
MPT-30B OOM 49
Falcon-7B 53 53 (1.0×)

Jetson Orin (mobile GPU) — tokens/sec:

Model FP16 AWQ + TinyChat
Llama-2-7B 12 39 (3.3×)
Llama-2-13B 9 25 (2.8×)
MPT-30B OOM 12
Falcon-7B 12 38 (3.2×)

Key results:

  • Consistent 3-4× speedup over FP16 Huggingface on both desktop and mobile GPUs
  • Enables models that couldn't fit in memory (MPT-30B = OOM in FP16) to run at interactive speeds
  • Llama-2-70B: deployable on a single NVIDIA Jetson Orin with 64GB memory

VILA visual-language models on multiple platforms (Table 10):

  • VILA-7B: 155.3 tokens/sec (4090) vs. 81.6 (FP16) — 1.9× speedup
  • VILA-13B: 102.1 tokens/sec (4090) — runs where FP16 is OOM

8. Limitations and Open Questions

AWQ is an excellent method, but it has several limitations worth understanding:

8.1 Still Requires Calibration Data

AWQ needs a small calibration dataset to measure average activation magnitudes per channel. While the dataset is small (16 sequences of 2048 tokens, 10× less than GPTQ's 192 sequences), it still:

  • Requires access to representative data for the target domain
  • Introduces a mild dependence on calibration distribution

The authors show AWQ is more robust to calibration distribution mismatch than GPTQ (Figure 8b — PPL increase of only 0.5-0.6 vs. GPTQ's 2.3-4.9 when calibration and evaluation domains don't match). But this robustness is a matter of degree, not elimination.

8.2 Simplified Search Space

The scaling search uses a single parameter α\alpha per layer, with s=sXα\mathbf{s} = \mathbf{s}_\mathbf{X}^\alpha. This constrains the search to a 1D curve in the space of all possible per-channel scales. A more expressive search might find better scales, at the cost of more computation.

8.3 Extreme Low-Bit Quantization (INT2)

At very aggressive 2-bit quantization (INT2), AWQ degrades significantly and GPTQ's reconstruction-based approach helps. The authors show that combining AWQ + GPTQ (Table 9) achieves the best INT2 results — AWQ alone is insufficient here.

8.4 Generative Tasks vs. Discriminative Tasks

AWQ is evaluated primarily on generative tasks (text generation, captioning). For classification or embedding tasks where the model architecture and use pattern differ, the activation magnitude heuristic may be less valid.

8.5 Quantization of Certain Architectures

Some architectures (e.g., attention layers with specific normalization strategies) may have different activation distribution patterns. The authors tested Mistral and Mixtral (Mixture-of-Experts) and found AWQ generalizes well, but the scaling heuristic could theoretically fail for exotic architectures.

8.6 No Formal Convergence Guarantee

The search in Equation (5) is a heuristic grid search. There is no theoretical guarantee that α\alpha^* found by the search is globally optimal. The 1D grid search is practical and effective empirically, but a more rigorous optimization (with accompanying theory) would strengthen the method.


9. Reproducibility and Code

AWQ is among the most practically impactful quantization methods and has excellent open-source support.

Official Implementation: https://github.com/mit-han-lab/llm-awq

TinyChat: Available within the same repository, supporting CUDA, ARM NEON, and Apple Metal backends.

Ecosystem Adoption: AWQ has been integrated into virtually every major LLM serving framework:

  • HuggingFace Transformers (AutoAWQForCausalLM)
  • NVIDIA TensorRT-LLM
  • Microsoft DirectML
  • Google Vertex AI
  • Intel Neural Compressor
  • Amazon SageMaker
  • vLLM, LMDeploy, FastChat

Reproducing the key experiment (Table 3 from the paper — OPT-6.7B INT3-g128):

1
2
3
4
5
6
7
pip install awq
python -m awq.quantize \
--model facebook/opt-6.7b \
--bits 3 \
--group-size 128 \
--calib-data pile \
--output opt-6.7b-awq-int3

Expected result: WikiText-2 PPL ≈ 11.39 (vs. RTN's 23.54, FP16's 10.86)

Minimal implementation of the core AWQ scaling search (pseudocode):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def awq_quantize_layer(W, X, bits=4, group_size=128, grid_size=20):
# W: weight matrix [out_features, in_features]
# X: calibration activations [in_features, num_samples]

# Compute per-input-channel activation magnitude
s_X = X.abs().mean(dim=1) # [in_features]

best_alpha = 0.0
best_loss = float('inf')

# Grid search over alpha in [0, 1]
for alpha in [i / grid_size for i in range(grid_size + 1)]:
s = s_X.pow(alpha) # Per-channel scale [in_features]

# Scale weights and activations
W_scaled = W * s.unsqueeze(0) # [out, in] * [in] → [out, in]
X_scaled = X / s.unsqueeze(1) # [in, samples] / [in, 1]

# Quantize scaled weights
W_q = quantize(W_scaled, bits, group_size)

# Compute reconstruction error
loss = (W_q @ X_scaled - W @ X).norm()

if loss < best_loss:
best_loss = loss
best_alpha = alpha

# Apply optimal scaling
s_opt = s_X.pow(best_alpha)
W_awq = quantize(W * s_opt.unsqueeze(0), bits, group_size)

# s_opt^{-1} is absorbed into the previous layer's bias/scale
return W_awq, s_opt

10. Connection to the Bigger Picture: Compression Landscape

AWQ sits in the broader landscape of LLM compression, alongside methods that target different axes:

10.1 Relation to GPTQ

Both AWQ and GPTQ are post-training quantization methods targeting the same problem. Their key differences:

Aspect AWQ GPTQ
Core principle Activation-aware scaling Hessian-based error correction
Calibration data 16 sequences 192 sequences
Backpropagation No No (but uses Hessian)
Overfitting risk Low Higher (Table 4, LLaMA-7B INT3)
Orthogonality Yes — can combine Yes — can combine

The authors show in Table 9 that AWQ + GPTQ combined outperforms either alone at extreme low-bit (INT2) quantization. They are complementary, not competing.

10.2 Relation to SVD-Based Compression

AWQ is a weight-space compression method. An alternative approach is low-rank decomposition via SVD (Singular Value Decomposition):

Given weight matrix WRm×n\mathbf{W} \in \mathbb{R}^{m \times n}, compute:

WUkΣkVkT\mathbf{W} \approx \mathbf{U}_k \mathbf{\Sigma}_k \mathbf{V}_k^T

where kmin(m,n)k \ll \min(m, n) captures the dominant singular values. Methods like AdaLoRA and ASVD (activation-aware SVD) use this principle.

The connection to AWQ: ASVD uses activation statistics similarly to AWQ — scaling rows and columns of WW by activation magnitudes before performing SVD, so the decomposition focuses compression budget on less-activated directions. This is the same activation-awareness principle, applied to a different compression primitive (rank reduction vs. quantization).

Key difference: Quantization preserves the exact matrix structure but reduces precision; SVD changes the matrix structure (it becomes a product of two smaller matrices) but preserves precision. Quantization generally achieves better compression ratios for the same accuracy on large models.

10.3 Relation to Knowledge Distillation and Pruning

The compression methods for LLMs form a landscape:

  • Quantization (AWQ, GPTQ, BitNet): Reduce precision of weights
  • Pruning (SparseGPT, Wanda): Remove weights entirely (set to zero)
  • Low-rank decomposition (SVD, LoRA): Replace full matrices with low-rank products
  • Knowledge distillation: Train a smaller model to mimic a larger one

AWQ focuses on quantization and is orthogonal to pruning and distillation — in principle, you could combine all three.


11. Conclusion

AWQ (Activation-aware Weight Quantization) is a landmark contribution to on-device LLM deployment. Let's recap its key contributions:

The intellectual insight: Weight importance in LLMs is determined by the activations flowing through them, not by the weights themselves. A 1% minority of "salient" weight channels — those corresponding to large-magnitude activations — are responsible for the majority of quantization-induced accuracy degradation.

The technical solution: Rather than using hardware-unfriendly mixed precision, AWQ applies a mathematically equivalent per-channel scaling that protects salient weights within the uniform INT4 format. The scaling factor is found by a fast, calibration-free grid search over activation statistics.

The system contribution: TinyChat translates the theoretical 4× memory savings into 3-4× real-world inference speedup on both desktop GPUs and mobile devices, through careful kernel fusion, SIMD-aware weight packing, and platform-specific optimization.

The impact: AWQ is now the de facto standard for 4-bit LLM quantization in production systems. It enables the Llama-2-70B model — which previously required 8 high-end GPUs — to run on a single mobile GPU (NVIDIA Jetson Orin with 64GB), or the 13B model to run interactively on a laptop GPU (RTX 4070 with 8GB).

The paper asks a simple question — "which weights matter most?" — and finds a simple answer: look at the activations. This is the essence of activation-awareness, and it's a principle that will likely continue to guide compression research for years to come.


Reviewed by Zhongzhu Zhou on 2026-04-03. Paper: AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration arXiv: 2306.00978 | MLSys 2024 Best Paper Award