0%

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — In-Depth Technical Review

Table of Contents

  1. Introduction: Why Model Quantization Matters
  2. Prerequisites: Background Knowledge You Need
  3. The GPTQ Method: Three Key Insights
  4. The Full Algorithm
  5. Experimental Results and Analysis
  6. Practical Speedups and Deployment
  7. Extreme Quantization and Grouping
  8. Limitations and Discussion
  9. Conclusion and Impact

1. Introduction: Why Model Quantization Matters

Consider the practical challenge of deploying a state-of-the-art large language model. GPT-3, with its 175 billion parameters, requires 326 GB of memory when stored in the compact FP16 (16-bit floating point) format. This exceeds the capacity of even the most powerful single GPU available (NVIDIA A100 with 80 GB), meaning you need at least 5 GPUs just for inference — not training, just running the model to generate text.

This is where model quantization enters the picture. Quantization reduces the precision of model parameters from high-bitwidth representations (FP16 at 16 bits per parameter) to low-bitwidth ones (e.g., INT4 at 4 bits per parameter). If we could compress a 175B model from 16 bits to 3-4 bits per weight, the memory footprint drops from 326 GB to roughly 63-84 GB — potentially fitting in a single GPU.

The fundamental tension is between compression ratio and model accuracy. Prior to GPTQ, the state of affairs was:

  • Simple methods (RTN — Round-to-Nearest): Work well at 8 bits but catastrophically fail at 4 bits or below. For example, RTN at 4 bits makes OPT-175B perform worse than the uncompressed 13B model.
  • Accurate methods (OBQ, BRECQ, AdaRound): Produce good results but are computationally infeasible for billion-parameter models. OBQ's O(drowdcol3)O(d_{row} \cdot d_{col}^3) complexity means it would take years on a 175B model.
  • Training-aware quantization (QAT): Requires full retraining, which costs tens of thousands of GPU hours for these models — completely impractical.

GPTQ resolves this dilemma. Published at ICLR 2023, it is the first method capable of quantizing a 175 billion parameter model to 3-4 bits in approximately four GPU hours, with negligible accuracy degradation. The compressed OPT-175B model can run on a single A100 GPU for the first time, achieving 3.25× inference speedup.


2. Prerequisites: Background Knowledge You Need

This section builds the conceptual foundation needed to understand GPTQ, even if you have no prior exposure to model compression.

2.1 Neural Network Weights and Matrix Multiplication

At its core, a neural network is a composition of mathematical operations. The most fundamental operation in Transformers is linear projection (matrix multiplication):

y=Wxy = Wx

Here, WW is a weight matrix (e.g., 12288 × 12288 for OPT-175B's feed-forward layers), xx is the input vector, and yy is the output. A large language model contains thousands of such weight matrices, and the sum of all their elements constitutes the model's "parameters." GPT-3 has 175 billion such numerical values.

2.2 What Is Quantization?

In a trained model, each weight is typically stored as an FP16 (16-bit floating point) or FP32 (32-bit floating point) number. Quantization maps these high-precision values onto a finite discrete set (the quantization grid).

For example, a 4-bit integer (INT4) can represent only 24=162^4 = 16 distinct values. The quantization process requires:

  1. Defining the grid: Determine a scale factor and zero point that map the floating-point range to integer range
  2. Rounding: Map each weight to its nearest grid point
  3. Dequantization: At inference time, convert integers back to approximate floating-point values

The simplest quantization scheme is RTN (Round-to-Nearest): directly round each weight to the nearest quantized value. This works acceptably for 8-bit quantization but causes severe accuracy degradation at 4 bits or lower.

2.3 Post-Training vs Training-Aware Quantization

Model quantization methods fall into two broad categories:

  • Quantization-Aware Training (QAT): Introduces quantization operations during training, allowing the model to learn to accommodate low precision. Effective but requires the full training pipeline — prohibitively expensive for 175B models (hundreds of thousands of GPU hours).
  • Post-Training Quantization (PTQ): Compresses a pre-trained model using a small amount of data (typically a few thousand samples), without retraining. Much cheaper computationally, but the technical challenge lies in maintaining accuracy under aggressive compression.

GPTQ is a post-training quantization method.

2.4 The Layer-Wise Quantization Framework

Modern PTQ methods adopt a layer-wise strategy: instead of processing the entire model at once, they handle it one layer at a time. For each layer, the objective is to find quantized weights W^\hat{W} that minimize the output error:

argminW^W^XWX22\arg\min_{\hat{W}} \|\hat{W}X - WX\|_2^2

where WW is the original weight matrix and XX is the layer input obtained from a small calibration dataset (typically 128 random sequences from C4). The essence of this optimization is: find the quantized weights that keep the layer's output as close to the original as possible.

This layer-wise approach is tractable because it decomposes the global problem into many independent sub-problems, one per layer.

2.5 The Hessian Matrix and Second-Order Information

A central concept in GPTQ is the Hessian matrix. In optimization, first-order information (gradients) tells you the rate of change of a function, while second-order information (the Hessian) tells you how the rate of change itself varies — capturing curvature.

For the layer-wise quantization problem, the Hessian is:

HF=2XFXFTH_F = 2X_F X_F^T

where FF is the set of not-yet-quantized weights and XFX_F is the corresponding input. The diagonal elements of the Hessian reflect how sensitive each weight is to the output: sensitive weights need careful quantization, while insensitive ones can tolerate coarser treatment.

The inverse Hessian HF1H_F^{-1} provides the optimal compensation directions — it tells us exactly how to adjust the remaining weights to best absorb the error from quantizing a single weight.

2.6 Optimal Brain Quantization (OBQ) — GPTQ's Foundation

GPTQ builds directly on the Optimal Brain Quantization (OBQ) method (Frantar et al., 2022), which itself extends the classic Optimal Brain Surgeon (OBS) framework from pruning to quantization. OBQ's core process:

  1. Select the weight whose quantization incurs the least additional error:

wq=argminwq(quant(wq)wq)2[HF1]qqw_q = \arg\min_{w_q} \frac{(\text{quant}(w_q) - w_q)^2}{[H_F^{-1}]_{qq}}

The numerator is the squared quantization error; the denominator is the corresponding diagonal of the inverse Hessian (reflecting the weight's "compensability").

  1. Compensate by updating all remaining weights:

δF=wqquant(wq)[HF1]qq(HF1):,q\delta_F = -\frac{w_q - \text{quant}(w_q)}{[H_F^{-1}]_{qq}} \cdot (H_F^{-1})_{:,q}

  1. Update the inverse Hessian by removing the quantized weight's row and column:

Hq1=H11[H1]qqH:,q1Hq,:1H_{-q}^{-1} = H^{-1} - \frac{1}{[H^{-1}]_{qq}} H_{:,q}^{-1} H_{q,:}^{-1}

OBQ's fatal flaw: computational complexity. For a drow×dcold_{row} \times d_{col} weight matrix, the cost is O(drowdcol3)O(d_{row} \cdot d_{col}^3). For OPT-175B where layers can be 12288 × 12288, this would require years of computation. GPTQ reduces this by over three orders of magnitude.


3. The GPTQ Method: Three Key Insights

GPTQ achieves its dramatic speedup through three algorithmic innovations, each addressing a different bottleneck.

3.1 Insight 1: Arbitrary Order Quantization

OBQ uses a greedy strategy — always quantizing the weight with the smallest current error. GPTQ's first breakthrough: on large, heavily-parameterized layers, the quantization order barely matters.

Why? The greedy strategy reduces the number of weights with large individual errors, but these "difficult" weights get pushed to the end of the process, when very few unquantized weights remain for compensation. The two effects roughly cancel out.

The key consequence: If order doesn't matter, we can quantize all rows of WW in the same column order. Since the Hessian HFH_F depends only on the layer input XFX_F (identical for all rows), the inverse Hessian updates are shared across all rows.

Complexity improvement:

  • OBQ: drowdcold_{row} \cdot d_{col} Hessian updates → O(drowdcol3)O(d_{row} \cdot d_{col}^3)
  • GPTQ: only dcold_{col} Hessian updates → O(max{drowdcol2,dcol3})O(\max\{d_{row} \cdot d_{col}^2, d_{col}^3\})

For large models where both dimensions are ~12,000, this is a 10,000× speedup.

3.2 Insight 2: Lazy Batch-Updates

Even with shared Hessian updates, a direct implementation is slow due to low compute-to-memory-access ratio. Updating the Hessian inverse (Equation 3) touches every element of a potentially huge matrix using only a few FLOPs per element. Modern GPUs have massive compute throughput but relatively limited memory bandwidth — such operations are memory-bound and severely underutilize the GPU.

GPTQ's solution — lazy batching:

  1. Divide columns into blocks of size B=128B = 128
  2. Within a block, only update that block's weights and the corresponding B×BB \times B Hessian sub-block
  3. After completing a block, perform a single global update to the entire weight matrix and Hessian

Why this works: The quantization decision for column ii depends only on updates applied to column ii itself — updates to later columns are irrelevant at that point. So deferring those updates is mathematically exact.

The multi-weight update formulas:

δF=(wQquant(wQ))([HF1]QQ)1(HF1):,Q\delta_F = -(w_Q - \text{quant}(w_Q))([H_F^{-1}]_{QQ})^{-1} (H_F^{-1})_{:,Q}

HQ1=H1H:,Q1([H1]QQ)1HQ,:1H_{-Q}^{-1} = H^{-1} - H_{:,Q}^{-1}([H^{-1}]_{QQ})^{-1} H_{Q,:}^{-1}

where QQ is the set of all column indices in the completed block.

This doesn't reduce theoretical FLOPs but converts many small memory operations into fewer large ones, providing an order of magnitude practical speedup on large models.

3.3 Insight 3: Cholesky Reformulation

At the scale of 175B parameters, repeated application of the Hessian inverse update formula (especially combined with batch updates) causes catastrophic numerical instabilities. Specifically, HF1H_F^{-1} can become indefinite, causing the algorithm to update remaining weights in wildly incorrect directions — producing arbitrarily bad quantizations.

The authors observed this problem's probability increases with model size: for models larger than a few billion parameters, it almost certainly occurs in at least some layers.

GPTQ's solution rests on a key observation: when quantizing weight qq, the algorithm only needs row qq of HFq1H_{F_q}^{-1} (specifically, elements from the diagonal onward). The row-removal operation in Equation 3 is essentially equivalent to Cholesky decomposition, differing only by a normalization factor.

Therefore, GPTQ can:

  1. Precompute all needed information via a single, numerically-stable Cholesky decomposition at the start
  2. Apply mild dampening: add λ=0.01×avg(diag(H))\lambda = 0.01 \times \text{avg}(\text{diag}(H)) to the diagonal before decomposition

This simultaneously:

  • Eliminates numerical instability
  • Provides additional speedup via optimized Cholesky kernels
  • Makes the algorithm robust enough for any model size

4. The Full Algorithm

Integrating all three insights, GPTQ's complete procedure is:

Input: Weight matrix WW (drow×dcold_{row} \times d_{col}), inverse Hessian H1=(2XXT+λI)1H^{-1} = (2XX^T + \lambda I)^{-1}, block size BB

Algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
Q ← 0 (d_row × d_col)            // quantized output
E ← 0 (d_row × B) // block quantization errors
H⁻¹ ← Cholesky(H⁻¹)ᵀ // precompute via Cholesky

for i = 0, B, 2B, ... do // iterate over column blocks
for j = i, ..., i+B-1 do // iterate within block
Q[:,j] ← quant(W[:,j]) // quantize current column
E[:,j-i] ← (W[:,j] - Q[:,j]) / H⁻¹[j,j] // scaled error
W[:,j:(i+B)] -= E[:,j-i] · H⁻¹[j,j:(i+B)] // update block
end for
W[:,(i+B):] -= E · H⁻¹[i:(i+B),(i+B):] // global update
end for

Intuitive understanding: Imagine the weight matrix as a large spreadsheet. We process it left-to-right in blocks of 128 columns. Within each block, we quantize columns one by one, immediately compensating within the block. After finishing a block, we do a single bulk update to all remaining columns. The Cholesky decomposition, computed once at the start, provides all the second-order information needed throughout.

Practical implementation details:

  • GPTQ loads one Transformer block (typically 6 layers) at a time into GPU memory
  • It accumulates the layer Hessians and performs quantization on each layer
  • After quantizing a block, inputs are re-computed through the quantized block to produce inputs for the next block (this "sequential quantization" noticeably improves results)
  • Calibration uses only 128 random 2048-token segments from the C4 dataset
  • Standard uniform per-row asymmetric quantization on the min-max grid

5. Experimental Results and Analysis

5.1 Validation on Small Models

First, GPTQ is validated against state-of-the-art PTQ methods on standard benchmarks (ResNet18 and ResNet50):

Method RN18 (69.76%) 4bit RN18 3bit RN50 (76.13%) 4bit RN50 3bit
AdaRound 69.34 68.37 75.84 75.14
AdaQuant 68.12 59.21 74.68 64.98
BRECQ 69.37 68.47 75.88 75.32
OBQ 69.56 68.69 75.72 75.24
GPTQ 69.37 67.88 75.71 74.87

GPTQ matches the best methods at 4-bit and is slightly behind at 3-bit, while being 60× faster (< 1 minute vs ~1 hour). On language models (BERT-base, OPT-125M), GPTQ slightly outperforms OBQ at 3-bit, possibly because OBQ's heuristics (like early outlier rounding) may need tuning for non-vision models.

5.2 Quantization Runtime

Model OPT-13B OPT-30B OPT-66B OPT-175B
Time 20.9 min 44.9 min 1.6 hr 4.2 hr
Model BLOOM-1.7B BLOOM-3B BLOOM-7.1B BLOOM-176B
Time 2.9 min 5.2 min 10.0 min 3.8 hr

All on a single NVIDIA A100 GPU. For reference, ZeroQuant-LKD takes 3 hours for a 1.3B model, which would extrapolate to hundreds of hours for 175B. GPTQ achieves a 100× improvement in quantization throughput per parameter.

5.3 Language Generation — OPT Model Family

WikiText2 perplexity (lower is better):

Model Bits 125M 350M 1.3B 2.7B 6.7B 13B 30B 66B 175B
Full 16 27.65 22.00 14.63 12.47 10.86 10.13 9.56 9.34 8.34
RTN 4 37.28 25.94 48.17 16.92 12.10 11.32 10.98 110 10.54
GPTQ 4 31.12 24.24 15.47 12.87 11.39 10.31 9.63 9.55 8.37
RTN 3 1.3e3 64.57 1.3e4 1.6e4 5.8e3 3.4e3 1.6e3 6.1e3 7.3e3
GPTQ 3 53.85 33.79 20.97 16.88 14.86 11.61 10.27 14.16 8.68

Key findings:

  1. 4-bit quantization: GPTQ on OPT-175B loses only 0.03 perplexity (8.34 → 8.37), while RTN loses 2.2 points (8.34 → 10.54). The RTN-quantized 175B model performs worse than the uncompressed 13B model!

  2. 3-bit quantization: RTN completely collapses (perplexity explodes to thousands), while GPTQ maintains reasonable accuracy. OPT-175B at 3-bit GPTQ achieves 8.68 perplexity — only 0.34 above the full-precision model, at over 5× compression.

  3. Scaling trend: Larger models are generally easier to quantize (with the exception of OPT-66B, which has dead units in early layers). This is excellent news for practical applications, since larger models benefit most from compression.

5.4 Language Generation — BLOOM Model Family

Model Bits 560M 1.1B 1.7B 3B 7.1B 176B
Full 16 22.42 17.69 15.39 13.48 11.37 8.11
RTN 4 25.90 22.00 16.97 14.76 12.10 8.37
GPTQ 4 24.03 19.05 16.48 14.20 11.73 8.21
RTN 3 57.08 50.19 63.59 39.36 17.38 571
GPTQ 3 32.31 25.08 21.11 17.40 13.47 8.64

BLOOM shows similar patterns with generally smaller gaps between methods, suggesting this model family may be inherently easier to quantize.

5.5 Deep Evaluation on 175B Models

Comprehensive results across multiple benchmarks for the two largest models:

Method Bits OPT-175B Wiki2 PTB C4 BLOOM-176B Wiki2 PTB C4 LAMB.↑
Baseline 16 8.34 12.01 10.13 8.11 14.59 11.71 67.40
GPTQ 4 8.37 12.26 10.28 8.21 14.75 11.81 67.71
GPTQ 3 8.68 12.68 10.67 8.64 15.57 12.27 65.10
GPTQ 3/g1024 8.45 10.47 10.47 8.35 15.01 11.98 67.47
GPTQ 3/g128 8.45 10.36 10.36 8.26 14.89 11.85 67.86

With group-size 1024 (adding only ~0.02 extra bits), perplexity improves by about 0.2 on average. Group-size 128 (~0.15 extra bits) brings results within 0.1-0.3 of uncompressed accuracy.

5.6 Zero-Shot Tasks

On LAMBADA, ARC (Easy and Challenge), and PIQA:

  • 4-bit: Both RTN and GPTQ perform relatively well
  • 3-bit: RTN collapses while GPTQ maintains strong accuracy
  • The pattern is consistent with perplexity results, confirming that GPTQ's compression is not just numerically faithful but preserves actual task performance

6. Practical Speedups and Deployment

6.1 Memory Savings

3-bit GPTQ OPT-175B:

  • Model weights: ~63 GB (embeddings and output layer kept in FP16)
  • KV cache (2048-token context): ~9 GB
  • Total: ~72 GB → fits in a single 80 GB A100 GPU!

Comparison:

  • FP16: requires 5× A100 80 GB GPUs
  • LLM.int8() (8-bit): requires 3× A100 80 GB GPUs
  • GPTQ 3-bit: requires only 1 A100

6.2 Inference Latency

GPU FP16 Latency 3bit GPTQ Speedup GPU Reduction
A6000 48 GB 589 ms/token 130 ms/token 4.53× 8 → 2
A100 80 GB 230 ms/token 71 ms/token 3.25× 5 → 1

The authors developed custom GPU kernels that perform quantized-matrix full-precision-vector products — dynamically dequantizing weights during the matrix-vector multiply. Although dequantization adds compute, the 5× reduction in memory loads (3 bits vs 16 bits) more than compensates, since generative inference is memory-bandwidth-bound (batch size 1, autoregressive token-by-token generation).

Crucially, this requires no activation quantization — only weights are compressed.

6.3 Deployment Significance

Before GPTQ, running GPT-3-class models required expensive multi-GPU clusters. GPTQ brought the barrier down to:

  • A single A100 for 175B models
  • Two A6000s (significantly cheaper than A100s) as an alternative
  • Faster inference than the multi-GPU FP16 baseline (eliminating inter-GPU communication overhead)

7. Extreme Quantization and Grouping

7.1 Grouping Strategy

GPTQ is fully compatible with grouped quantization: instead of using one set of quantization parameters (scale + zero point) per entire row, group consecutive gg weights and give each group independent parameters.

Group Size Avg Bits OPT-175B Wiki2 PPL BLOOM-176B Wiki2 PPL
No grouping 3.00 8.68 8.64
g=1024 ~3.02 8.45 8.35
g=128 ~3.15 8.45 8.26
Full precision 16.00 8.34 8.11

The storage overhead of grouping is minimal (0.02-0.15 extra bits per weight), but the accuracy improvement is substantial. GPTQ naturally integrates grouping into its quantization process, using the most current updated weights when determining group parameters.

7.2 2-Bit Quantization

GPTQ can push into the extreme 2-bit regime:

Model FP16 PPL g=128 (~2.2 bit) g=64 (~2.4 bit) g=32 (~2.6 bit) 3-bit vanilla
OPT-175B 8.34 9.58 9.18 8.94 8.68
BLOOM-176B 8.11 9.55 9.17 8.83 8.64

At ~2.2 bits, perplexity increases by less than 1.5 points. At ~2.6 bits, the increase is only 0.6-0.7.

Most remarkably, with group-size 8, GPTQ achieves ternary quantization (-1, 0, +1) on OPT-175B with only 9.20 WikiText2 perplexity — less than 1 point degradation from full precision. This pattern could be extremely efficient on custom hardware like FPGAs.


8. Limitations and Discussion

8.1 Technical Limitations

  1. Memory-movement savings only, no compute reduction: GPTQ's speedup comes from reduced data loading, not reduced arithmetic. For compute-bound scenarios (large-batch inference), the speedup is limited. For large batches, the approach reduces to decompressing the matrix before computation (taking < 1.5 ms on A100) with minimal overhead.

  2. Weight-only quantization: GPTQ does not quantize activations. While the authors argue activations are not the bottleneck for generative inference (which is memory-bandwidth-bound), activation quantization could provide additional benefits for other workloads. This can be addressed using orthogonal techniques like those in ZeroQuant.

  3. Hardware support gap: At the time of publication, mainstream GPUs lacked native support for mixed-precision operations (FP16 × INT4). GPTQ requires dynamic dequantization, adding overhead. This limitation is being addressed by newer hardware generations.

8.2 The OPT-66B Anomaly

OPT-66B is the sole exception to the "larger models are easier to quantize" trend. Investigation reveals this model has a significant fraction of dead units in early layers — a training artifact that makes compression more difficult. This highlights that model quality affects quantizability.

8.3 Calibration Data

GPTQ uses only 128 random 2048-token segments from C4 — generic web text with no task-specific data. This means:

  • Results are genuinely "zero-shot" with respect to evaluation tasks
  • The method is task-agnostic and broadly applicable
  • But task-specific calibration data might yield even better results for specific applications

8.4 Bias and Safety Considerations

The authors note that while GPTQ preserves standard accuracy metrics, the impact of compression on secondary measures — particularly bias effects — has not been thoroughly studied. Making large models more accessible is a double-edged sword: it democratizes AI capabilities but also lowers the barrier for misuse.


9. Conclusion and Impact

9.1 Core Contributions

GPTQ represents a watershed moment in large language model compression:

  1. First demonstration: 175B models quantized to 3-4 bits in ~4 GPU hours with negligible accuracy loss
  2. Three algorithmic innovations: Arbitrary-order quantization (complexity reduction), lazy batch-updates (GPU utilization), Cholesky reformulation (numerical stability)
  3. Practical deployment: GPT-3-class models running on a single GPU, with 3.25-4.5× inference speedup
  4. Accuracy preservation: Only 0.03 perplexity loss at 4-bit on OPT-175B

9.2 Downstream Impact

GPTQ's influence extends far beyond the paper itself:

  • AutoGPTQ: Became one of the most widely-used quantization tools in the open-source ecosystem
  • Spawned AWQ, SqueezeLLM, QuIP, and many successors: Catalyzed an entire ecosystem of quantization research
  • HuggingFace integration: Became a standard quantization option for open-source LLM deployment
  • Changed the deployment paradigm: From "requires a cluster" to "runs on a single GPU"
  • Enabled consumer-grade LLM inference: Models like LLaMA and Mistral are routinely quantized with GPTQ variants for local deployment

9.3 Key Takeaways

  1. Clever algorithms beat brute force: A 10,000× speedup over OBQ through three simple-but-profound insights, not through more hardware
  2. Second-order information is powerful: The Hessian-based compensation is what makes 3-4 bit quantization possible where naive rounding fails catastrophically
  3. Systems optimization matters equally: Lazy batching and Cholesky reformulation are not algorithmic innovations per se, but engineering decisions critical to practical viability
  4. The theory-practice bridge: Custom GPU kernels and inference harnesses are what transform a mathematical result into deployable technology

9.4 Reproducibility

The authors provide complete code at https://github.com/IST-DASLab/gptq including: model compression scripts for all OPT and BLOOM variants, perplexity evaluation, 3-bit CUDA kernels, zero-shot evaluation, and benchmarking tools.


References

  1. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. ICLR 2023.
  2. Frantar, E., Singh, S. P., & Alistarh, D. (2022). Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning. NeurIPS 2022.
  3. Hassibi, B., Stork, D. G., & Wolff, G. J. (1993). Optimal Brain Surgeon and General Network Pruning. IEEE International Conference on Neural Networks.
  4. Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
  5. Yao, Z., et al. (2022). ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers.
  6. Nagel, M., et al. (2020). Up or Down? Adaptive Rounding for Post-Training Quantization. ICML 2020.
  7. Li, Y., et al. (2021). BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction. ICLR 2021.
  8. Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS 2020.
  9. Zhang, S., et al. (2022). OPT: Open Pre-trained Transformer Language Models.
  10. Park, G., et al. (2022). nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models.