Scaling Law for Quantization-Aware Training

AI-generated keywords: Large language models

AI-generated Key Points

  • Large language models (LLMs) pose challenges in terms of computational and memory resources for deployment
  • Quantization-aware training (QAT) reduces model precision while maintaining performance
  • Existing scaling laws for QAT, particularly at 4-bit precision (W4A4), lack a comprehensive understanding
  • A unified scaling law for QAT is introduced in the paper, considering quantization error as a function of model size, training data volume, and quantization group size
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, Ping Luo

A unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size
License: CC BY 4.0

Abstract: Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.

Submitted to arXiv on 20 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.14302v1

, , , , Large language models (LLMs) present challenges in terms of computational and memory resources, making deployment difficult. Quantization-aware training (QAT) has emerged as a solution to these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, particularly at 4-bit precision (W4A4), lacks a comprehensive understanding. Existing scaling laws for QAT often overlook crucial factors such as the number of training tokens and quantization granularity, limiting their practicality. To address this gap, this paper introduces a unified scaling law for QAT that considers quantization error as a function of model size, training data volume, and quantization group size. Through 268 experiments on QAT, the study reveals that quantization error decreases with increasing model size but increases with more training tokens and coarser quantization granularity. The analysis further decomposes W4A4 quantization error into weight and activation components, both following the overall trend but exhibiting different sensitivities. Specifically, weight quantization error escalates more rapidly with an increase in training tokens. The research identifies activation quantization error in the FC2 layer as the primary bottleneck of W4A4 QAT quantization error due to outliers. By implementing mixed-precision quantization to address this bottleneck, the study demonstrates that weight and activation quantization errors can converge to similar levels. Moreover, with additional training data, weight quantization error eventually surpasses activation quantization error, emphasizing the importance of reducing weight quantization error in such scenarios.
Created on 22 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.