Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

AI-generated keywords: Quantization

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Authors conducted a study on the impact of quantization on reasoning language models
Quantization aims to reduce inference cost but its effects on reasoning models have not been thoroughly explored
Evaluation of various model families across a wide parameter range from 1.5B to 70B parameters
Lossless quantization possible with certain methods, but lower bit-widths introduce accuracy risks
Model size, origin, and task difficulty are crucial factors influencing performance
Quantized models do not show increased output lengths as expected
Scaling model sizes or reasoning steps can enhance performance

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou

arXiv: 2504.04823v2 - DOI (cs.CL)

COLM 2025

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this paper, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, QwQ-32B, and Qwen3-8B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes are open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

Submitted to arXiv on 07 Apr. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.04823v2

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , In their paper titled "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models," authors Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, and Lu Hou delve into the impact of quantization on reasoning language models. These models have shown impressive performance in complex tasks but come with increased inference overhead due to their extended chain-of-thought reasoning process. While quantization has been widely used to reduce the inference cost of large language models, its effects on reasoning models have not been thoroughly explored. The study conducted by the authors is the first systematic investigation into quantized reasoning models. They evaluate various open-sourced model families such as DeepSeek-R1-Distilled Qwen and LLaMA across a wide parameter range from 1.5B to 70B parameters, including QwQ-32B and Qwen3-8B. The evaluation covers weight, KV cache, and activation quantization using cutting-edge algorithms at different bit-widths. The findings of the study reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization methods, lower bit-widths introduce significant accuracy risks. The authors also identify model size, model origin, and task difficulty as crucial factors influencing performance. Surprisingly, quantized models do not exhibit increased output lengths as expected. Additionally, strategically scaling model sizes or reasoning steps can effectively enhance performance. To facilitate further research in this area,<kgd> all quantized models and codes used in the study are openly available on GitHub at https://github.com/ruikangliu/Quantized-Reasoning-Models.

- Authors conducted a study on the impact of quantization on reasoning language models
- Quantization aims to reduce inference cost but its effects on reasoning models have not been thoroughly explored
- Evaluation of various model families across a wide parameter range from 1.5B to 70B parameters
- Lossless quantization possible with certain methods, but lower bit-widths introduce accuracy risks
- Model size, origin, and task difficulty are crucial factors influencing performance
- Quantized models do not show increased output lengths as expected
- Scaling model sizes or reasoning steps can enhance performance

SummaryAuthors studied how making reasoning language models simpler can affect their performance. They looked at different types of models with varying sizes and features. Some methods can make models simpler without losing information, but using fewer bits may reduce accuracy. The size of the model, where it comes from, and how hard the task is all play a big role in how well it works. Making models bigger or improving how they think can make them work better. Definitions- Quantization: Simplifying a model by reducing the amount of information it uses. - Inference cost: The resources needed to process and understand information. - Lossless quantization: Making a model simpler without losing any important details. - Accuracy risks: The chance that simplifying a model will make it less correct. - Parameter range: Different settings or characteristics that affect how a model works.

Introduction

Language models have become increasingly popular in recent years due to their impressive performance in various natural language processing tasks. These models use a chain-of-thought reasoning process to generate text, making them ideal for complex tasks such as question-answering and dialogue generation. However, this extended reasoning process also comes with increased inference overhead, making it challenging to deploy these models in real-time applications. To address this issue, quantization has been widely used to reduce the inference cost of large language models. Quantization is a technique that reduces the precision of numerical values while maintaining their relative differences. This results in smaller model sizes and faster computations without significant loss of accuracy. While quantization has been successful in improving the efficiency of traditional language models, its effects on reasoning models have not been thoroughly explored. In their paper titled "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models," authors Ruikang Liu et al. delve into the impact of quantization on reasoning language models. They conduct a systematic investigation using various open-sourced model families and evaluate different quantization methods at varying bit-widths.

Methodology

The study conducted by Liu et al. evaluates four open-sourced model families: DeepSeek-R1-Distilled Qwen, LLaMA, QwQ-32B, and Qwen3-8B across a wide parameter range from 1.5B to 70B parameters. The evaluation covers weight, KV cache (key-value cache), and activation quantization using cutting-edge algorithms at different bit-widths. To measure the impact of quantization on reasoning language models accurately, the authors use two metrics: perplexity (PPL) and output length (Len). Perplexity measures how well a model predicts a sample dataset by calculating the average log-probability per word or character. Output length measures the average number of words or characters generated by a model per sample.

Results

The findings of the study reveal that while lossless quantization can be achieved with W8A8 (weight 8-bit, activation 8-bit) or W4A16 (weight 4-bit, activation 16-bit) quantization methods, lower bit-widths introduce significant accuracy risks. The authors also identify model size, model origin, and task difficulty as crucial factors influencing performance. Interestingly, the study found that quantized models do not exhibit increased output lengths as expected. This suggests that quantization does not affect the reasoning process itself but rather the quality of representations used in this process. Moreover, strategically scaling model sizes or reasoning steps can effectively enhance performance. For example, increasing the number of reasoning steps in smaller models can compensate for their reduced capacity due to quantization.

Conclusion

In conclusion, Liu et al.'s paper is an essential contribution to understanding the impact of quantization on reasoning language models. Their systematic investigation reveals that while lower bit-widths may improve inference speed and reduce model size, they come at a cost of decreased accuracy. However, their findings also suggest potential strategies for mitigating these effects and improving overall performance. To facilitate further research in this area, all quantized models and codes used in the study are openly available on GitHub at https://github.com/ruikangliu/Quantized-Reasoning-Models. This will allow other researchers to replicate and build upon their work to advance our understanding of how quantization affects reasoning language models. Overall, "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models" provides valuable insights into an important aspect of language modeling that has been largely overlooked until now. It highlights both the potential benefits and drawbacks of quantization on reasoning models, paving the way for further research and development in this area.

Created on 22 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

81.3%

Does your LLM truly unlearn? An embarrassingly simple approach to recover unl…

cs.CL

80.4%

LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models

cs.CL

80.1%

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Impr…

cs.CL

79.7%

Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-F…

cs.CL

79.6%

Scaling Relationship on Learning Mathematical Reasoning with Large Language M…

cs.CL

79.1%

Evaluating Quantized Large Language Models

cs.CL

78.3%

From Heuristic to Analytic: Cognitively Motivated Strategies for Coherent Phy…

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.