KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
AI-generated Key Points
- Efficiently serving large language models (LLMs) requires batching of many requests to reduce cost per request
- Increasing batch sizes and context lengths can bottleneck the key-value (KV) cache in terms of speed and memory usage
- Quantization can reduce the size of the KV cache by decreasing total bytes
- Developed KIVI, a tuning-free 2bit KV cache quantization algorithm based on analyzing element distribution in popular LLMs' KV caches
- Implementing hardware-friendly techniques allows models like Llama, Falcon, and Mistral to use significantly less peak memory while maintaining quality
- Reduction in memory usage allows for larger batch sizes and improves throughput on real LLM inference workloads
- Applied KIVI to various models with minimal impact on accuracy across challenging long context generation tasks in LongBench results
- NIAH results show that KIVI maintains retrieval ability even with 2bit KV Cache
- Benchmarking KIVI on GSM8K in ablation studies to analyze hyperparameters group size G and residual length R on model performance
- Group size significantly impacts KV cache compression effectiveness under long inputs; reasonably large residual length is crucial for performance boosts on difficult tasks like GSM8K
- Efficiency comparison shows that KIVI is an effective method for compressing KV caches without sacrificing model quality across various LLMs and tasks
Authors: Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, Xia Hu
Abstract: Efficiently serving large language models (LLMs) requires batching of many requests to reduce the cost per request. Yet, with larger batch sizes and longer context lengths, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. Additionally, the loading of the KV cache causes the computational core to be idle, which limits the inference speed. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm named KIVI. With hardware-friendly implementation, KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $\mathbf{2.6\times}$ less peak memory (including model weight). This reduction in memory usage enables up to $\mathbf{4\times}$ larger batch size, bringing $\mathbf{2.35\times \sim 3.47\times}$ throughput on real LLM inference workload. The source code is available at https://github.com/jy-yuan/KIVI.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.