ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization

AI-generated keywords: Quantized Model Optimization

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Debate over optimal bit-width for quantized model optimization:
  • Some advocate for 4-bit quantization
  • Others argue that 1.58-bit yields superior results
  • Introduction of ParetoQ framework:
  • Enables comparisons across bit-widths from 1-bit to 4-bit
  • Reveals significant learning transition between 2 and 3 bits
  • Performance of ParetoQ framework:
  • Surpasses previous methods tailored to specific bit widths
  • Ternary model with 600 million parameters outperformed state-of-the-art ternary model with 3 billion parameters while using fewer parameters
  • Comparison of quantization approaches:
  • Ternary, 2-bit, and 3-bit quantization consistently deliver comparable performance in size-accuracy trade-off
  • Outperform both 4-bit and binary quantization approaches
  • Promise of 2-bit quantization in terms of hardware limitations:
  • Significant promise in memory reduction and speed enhancement
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra

NeurIPS 2025. Model weights are available at https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95

Abstract: The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.

Submitted to arXiv on 04 Feb. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2502.02631v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the realm of quantized model optimization, the debate over the optimal bit-width for balancing model size and accuracy has persisted. While some advocate for 4-bit quantization, others argue that 1.58-bit yields superior results. However, the absence of a comprehensive framework for various bit-widths has rendered such conclusions uncertain. Enter ParetoQ, a groundbreaking unified framework that enables rigorous comparisons across a spectrum of quantization settings ranging from 1-bit to 4-bit. Through meticulous experimentation, it has been revealed that there exists a significant learning transition between 2 and 3 bits: Models fine-tuned with 3-bits or more tend to maintain proximity to their original pre-trained distributions, whereas those trained with 2 bits or fewer undergo drastic representation changes. By honing training methodologies and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Notably, the ParetoQ ternary model boasting 600 million parameters has even outperformed the previous state-of-the-art ternary model with 3 billion parameters in terms of accuracy while utilizing only one-fifth of the parameters. Extensive experimentation further demonstrates that ternary, 2-bit, and 3-bit quantization consistently deliver comparable performance in the size-accuracy trade-off and generally outperform both 4-bit and binary quantization approaches. Considering hardware limitations, it is highlighted that 2-bit quantization holds significant promise in terms of memory reduction and speed enhancement. The findings presented by ParetoQ not only shed light on the intricate nuances of low-bit LLM quantization but also pave the way for more efficient and effective model optimization strategies in the field of machine learning.
Created on 20 Mar. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.