KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

AI-generated keywords: Large language models Batching KV cache bottleneck Quantization KIVI algorithm

AI-generated Key Points

  • Efficiently serving large language models (LLMs) requires batching of many requests to reduce cost per request
  • Increasing batch sizes and context lengths can bottleneck the key-value (KV) cache in terms of speed and memory usage
  • Quantization can reduce the size of the KV cache by decreasing total bytes
  • Developed KIVI, a tuning-free 2bit KV cache quantization algorithm based on analyzing element distribution in popular LLMs' KV caches
  • Implementing hardware-friendly techniques allows models like Llama, Falcon, and Mistral to use significantly less peak memory while maintaining quality
  • Reduction in memory usage allows for larger batch sizes and improves throughput on real LLM inference workloads
  • Applied KIVI to various models with minimal impact on accuracy across challenging long context generation tasks in LongBench results
  • NIAH results show that KIVI maintains retrieval ability even with 2bit KV Cache
  • Benchmarking KIVI on GSM8K in ablation studies to analyze hyperparameters group size G and residual length R on model performance
  • Group size significantly impacts KV cache compression effectiveness under long inputs; reasonably large residual length is crucial for performance boosts on difficult tasks like GSM8K
  • Efficiency comparison shows that KIVI is an effective method for compressing KV caches without sacrificing model quality across various LLMs and tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu

ICML2024
License: CC BY 4.0

Abstract: Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. Yet, with larger batch sizes and longer context lengths, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. Additionally, the loading of the KV cache causes the computational core to be idle, which limits the inference speed. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm named KIVI. With hardware-friendly implementation, KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $\mathbf{2.6\times}$ less peak memory (including model weight). This reduction in memory usage enables up to $\mathbf{4\times}$ larger batch size, bringing $\mathbf{2.35\times \sim 3.47\times}$ throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI.

Submitted to arXiv on 05 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.02750v2

Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. However, as batch sizes and context lengths increase, the key-value (KV) cache becomes a bottleneck in terms of speed and memory usage. To address this issue, quantization can reduce the size of the KV cache by decreasing its total bytes. Our study focused on analyzing element distribution in popular LLMs' KV caches to optimize quantization. We developed KIVI, a tuning-free 2bit KV cache quantization algorithm based on our findings. By implementing hardware-friendly techniques, models like Llama, Falcon, and Mistral can maintain quality while using significantly less peak memory. This reduction allows for larger batch sizes and improves throughput on real LLM inference workloads. In LongBench results, we applied KIVI to various models with minimal impact on accuracy across challenging long context generation tasks. Additionally, our NIAH results show that KIVI maintains retrieval ability even with 2bit KV Cache. In ablation studies, we benchmarked KIVI on GSM8K to analyze the effect of hyperparameters group size G and residual length R on model performance. The choice of group size significantly impacts KV cache compression effectiveness under long inputs. While there is no consistent pattern between residual lengths and model accuracy, having a reasonably large residual length is crucial for performance boosts on difficult tasks like GSM8K. Overall, our efficiency comparison shows that KIVI is an effective method for compressing KV caches without sacrificing model quality across various LLMs and tasks. The source code for KIVI is available at https://github.com/jy-yuan/KIVI for further exploration and implementation.
Created on 03 Sep. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.