Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models

AI-generated keywords: Quantization

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors conducted a study on the impact of quantization on reasoning language models
  • Quantization aims to reduce inference cost but its effects on reasoning models have not been thoroughly explored
  • Evaluation of various model families across a wide parameter range from 1.5B to 70B parameters
  • Lossless quantization possible with certain methods, but lower bit-widths introduce accuracy risks
  • Model size, origin, and task difficulty are crucial factors influencing performance
  • Quantized models do not show increased output lengths as expected
  • Scaling model sizes or reasoning steps can enhance performance
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou

COLM 2025

Abstract: Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this paper, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, QwQ-32B, and Qwen3-8B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes are open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.

Submitted to arXiv on 07 Apr. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.04823v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In their paper titled "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models," authors Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, and Lu Hou delve into the impact of quantization on reasoning language models. These models have shown impressive performance in complex tasks but come with increased inference overhead due to their extended chain-of-thought reasoning process. While quantization has been widely used to reduce the inference cost of large language models, its effects on reasoning models have not been thoroughly explored. The study conducted by the authors is the first systematic investigation into quantized reasoning models. They evaluate various open-sourced model families such as DeepSeek-R1-Distilled Qwen and LLaMA across a wide parameter range from 1.5B to 70B parameters, including QwQ-32B and Qwen3-8B. The evaluation covers weight, KV cache, and activation quantization using cutting-edge algorithms at different bit-widths. The findings of the study reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization methods, lower bit-widths introduce significant accuracy risks. The authors also identify model size, model origin, and task difficulty as crucial factors influencing performance. Surprisingly, quantized models do not exhibit increased output lengths as expected. Additionally, strategically scaling model sizes or reasoning steps can effectively enhance performance. To facilitate further research in this area,<kgd> all quantized models and codes used in the study are openly available on GitHub at https://github.com/ruikangliu/Quantized-Reasoning-Models.
Created on 22 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.