FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

AI-generated keywords: Large Language Model

AI-generated Key Points

  • Large Language Model (LLM) inference efficiency improvement through quantization
  • Introduction of Fine-Grained Mixed Precision (FGMP) quantization as a hardware-software co-design methodology post-training
  • Development of a policy using perturbation and Fisher information to determine precision levels for weight and activation blocks
  • Proposal of sensitivity-weighted clipping approach for fine-grained quantization
  • Hardware augmentations for maximizing efficiency benefits, including support at block granularity and mixed-precision activation quantization unit
  • Achieved less than 1% perplexity degradation on Wikitext-103 with FGMP compared to all-FP8 baseline design
  • Consumes 14% less energy during inference and requires 30% less weight memory
  • Analysis includes examining percentage of blocks retained in high precision across different layers, optimizing performance beyond linear layers to fully connected layers like FC2
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Coleman Hooper, Charbel Sakr, Ben Keller, Rangharajan Venkatesan, Kurt Keutzer, Sophia Shao, Brucek Khailany

License: CC BY 4.0

Abstract: Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations to low precision is challenging without degrading model accuracy. We propose fine-grained mixed precision (FGMP) quantization, a post-training mixed-precision quantization hardware-software co-design methodology that maintains accuracy while quantizing the majority of weights and activations to reduced precision. Our work makes the following contributions: 1) We develop a policy that uses the perturbation in each value, weighted by the Fisher information, to select which weight and activation blocks to keep in higher precision. This approach preserves accuracy by identifying which weight and activation blocks need to be retained in higher precision to minimize the perturbation in the model loss. 2) We also propose a sensitivity-weighted clipping approach for fine-grained quantization which helps retain accuracy for blocks that are quantized to low precision. 3) We then propose hardware augmentations to leverage the efficiency benefits of FGMP quantization. Our hardware implementation encompasses i) datapath support for FGMP at block granularity, and ii) a mixed-precision activation quantization unit to assign activation blocks to high or low precision on the fly with minimal runtime and energy overhead. Our design, prototyped using NVFP4 (an FP4 format with microscaling) as the low-precision datatype and FP8 as the high-precision datatype, facilitates efficient FGMP quantization, attaining <1% perplexity degradation on Wikitext-103 for the Llama-2-7B model relative to an all-FP8 baseline design while consuming 14% less energy during inference and requiring 30% less weight memory.

Submitted to arXiv on 19 Apr. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.14152v1

, , , , In the realm of Large Language Model (LLM) inference efficiency improvement, quantization emerges as a potent tool by leveraging low-precision datapaths for enhanced energy efficiency and reduced memory footprint. However, accurately quantizing LLM weights and activations to lower precision without compromising model accuracy remains a challenge. To address this issue, we introduce Fine-Grained Mixed Precision (FGMP) quantization - a hardware-software co-design methodology post-training that maintains accuracy while quantizing the majority of weights and activations to reduced precision. Our work presents several key contributions 1) We have developed a policy that utilizes perturbation in each value, weighted by the Fisher information, to determine which weight and activation blocks should be retained in higher precision. This approach ensures accuracy preservation by identifying the specific blocks that need higher precision to minimize perturbation in model loss. 2) Additionally, we propose a sensitivity-weighted clipping approach for fine-grained quantization, aiding in retaining accuracy for blocks quantized to lower precision. 3) Furthermore, we introduce hardware augmentations aimed at maximizing the efficiency benefits of FGMP quantization. This includes datapath support for FGMP at block granularity and a mixed-precision activation quantization unit enabling real-time assignment of activation blocks to high or low precision with minimal runtime and energy overhead. Our hardware implementation utilizes NVFP4 as the low-precision datatype and FP8 as the high-precision datatype. Through efficient FGMP quantization, our design achieves less than 1% perplexity degradation on Wikitext-103 for the Llama-2-7B model compared to an all-FP8 baseline design while consuming 14% less energy during inference and requiring 30% less weight memory. Moreover, our analysis delves into examining the percentage of blocks retained in high precision across different layers, providing valuable insights into optimizing performance. This comprehensive approach extends beyond linear layers to encompass fully connected layers such as FC2, enhancing overall model efficiency and accuracy.
Created on 20 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.