RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

AI-generated keywords: Transformer-based Large Language Models

AI-generated Key Points

Transformer-based Large Language Models heavily rely on KV cache for extended context handling during decoding
RocketKV introduced as a training-free KV cache compression strategy
Consists of coarse-grain KV cache eviction using SnapKV++ and fine-grain top-k sparse attention through hybrid attention method
RocketKV reduces memory bandwidth and storage demands during decoding while maintaining accuracy comparable to full KV cache attention
Up to 3x end-to-end speedup and up to 31% peak memory reduction on NVIDIA H100 GPU compared to full KV cache baseline
Negligible accuracy loss across various long-context tasks with RocketKV
Varying kernel sizes under RULER with different sequence lengths lead to linear improvements in speedup and peak memory savings
Ablation studies confirm efficacy of RocketKV, highlighting superiority of SnapKV with grouped-query attention enhancement over original SnapKV for performance with low token budgets

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

arXiv: 2502.14051v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Transformer-based Large Language Models rely critically on KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy designed specifically to reduce both memory bandwidth and capacity demand of KV cache during the decode phase. RocketKV contains two consecutive stages. In the first stage, it performs coarse-grain KV cache eviction on the input sequence tokens with SnapKV++, a method improved upon SnapKV by introducing adaptive pooling size and full compatibility with grouped-query attention. In the second stage, it adopts a hybrid attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensional reductions. Combining these two stages, RocketKV achieves significant KV cache fetching bandwidth and storage savings while maintaining comparable accuracy to full KV cache attention. We show that RocketKV provides end-to-end speedup by up to 3$\times$ as well as peak memory reduction by up to 31% in the decode phase on an NVIDIA H100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks.

Submitted to arXiv on 19 Feb. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2502.14051v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , Transformer-based Large Language Models heavily rely on KV cache to efficiently handle extended contexts during the decoding phase. However, the size of the KV cache grows with the input length, causing memory bandwidth and capacity issues as decoding progresses. To tackle this challenge, RocketKV is introduced as a training-free KV cache compression strategy. It consists of two stages: coarse-grain KV cache eviction using SnapKV++ in the first stage and fine-grain top-k sparse attention through a hybrid attention method in the second stage. RocketKV effectively reduces memory bandwidth and storage demands of the KV cache during decoding while maintaining accuracy comparable to full KV cache attention. The approach delivers up to 3x end-to-end speedup and up to 31% peak memory reduction on an NVIDIA H100 GPU compared to the full KV cache baseline. <break> Notably, RocketKV shows negligible accuracy loss across various long-context tasks. Furthermore, experiments demonstrate that varying kernel sizes under RULER with different sequence lengths lead to linear improvements in both speedup and peak memory savings. Ablation studies confirm the efficacy of RocketKV, particularly highlighting the superiority of SnapKV with grouped-query attention enhancement over the original SnapKV in terms of performance with low token budgets. In conclusion, <keyword>RocketKV presents a novel solution for accelerating Long-Context LLM Inference by efficiently compressing KV cache without sacrificing accuracy,</keyword> showcasing significant improvements in speedup and memory efficiency compared to traditional methods.

- Transformer-based Large Language Models heavily rely on KV cache for extended context handling during decoding
- RocketKV introduced as a training-free KV cache compression strategy
- Consists of coarse-grain KV cache eviction using SnapKV++ and fine-grain top-k sparse attention through hybrid attention method
- RocketKV reduces memory bandwidth and storage demands during decoding while maintaining accuracy comparable to full KV cache attention
- Up to 3x end-to-end speedup and up to 31% peak memory reduction on NVIDIA H100 GPU compared to full KV cache baseline
- Negligible accuracy loss across various long-context tasks with RocketKV
- Varying kernel sizes under RULER with different sequence lengths lead to linear improvements in speedup and peak memory savings
- Ablation studies confirm efficacy of RocketKV, highlighting superiority of SnapKV with grouped-query attention enhancement over original SnapKV for performance with low token budgets

Summary- Big language models like Transformers need a special memory system called KV cache to help them understand more words when they are talking or writing. - RocketKV is a new way to make this memory system smaller without needing extra training, so the big models can work faster and use less computer memory. - RocketKV uses two methods to decide which words to keep in its memory: one method is like throwing away big groups of words, and the other method is like focusing on only the most important words. - With RocketKV, the big language models can still work just as well as before but with less strain on their memory systems, making them faster and more efficient. - Tests show that RocketKV can make these big language models run up to three times faster and use up to 31% less computer memory compared to how they usually work. Definitions- Transformer-based Large Language Models: Big computer programs that help understand and generate human-like text using a specific technology called Transformers. - KV cache: A special type of memory storage used by these big language models to remember important information about words during conversations or writing. - Decoding: The process of generating text output from these language models based on the input they receive. - Memory bandwidth: The amount of data that can be transferred between the main memory of a computer and other parts quickly. - Storage demands: The space needed in a computer's memory or storage devices for saving data.

Introduction

Transformer-based Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks, such as machine translation, text summarization, and question-answering. These models heavily rely on Key-Value (KV) cache to efficiently handle extended contexts during the decoding phase. However, as the input length increases, so does the size of the KV cache. This poses a challenge for memory bandwidth and capacity during decoding, leading to slower inference times and higher resource requirements. To address this issue, a research paper titled "RocketKV: Accelerating Long-Context LLM Inference by Compressing KV Cache" proposes a training-free KV cache compression strategy called RocketKV. The approach aims to reduce memory bandwidth and storage demands of the KV cache while maintaining accuracy comparable to full KV cache attention.

The Challenge of Extended Contexts in LLMs

The Transformer architecture used in LLMs has an inherent limitation when it comes to handling long sequences due to its self-attention mechanism. As the sequence length increases, so does the computational cost and memory requirements for attending to all tokens in each layer. To overcome this limitation, LLMs use a technique called key-value caching during decoding. During decoding, each token is attended over all previous tokens using a query-key-value attention mechanism. The values are stored in a key-value store or cache for efficient retrieval during subsequent steps. This allows for extended context modeling without increasing computational cost or memory requirements significantly. However, as mentioned earlier, the size of this KV cache grows with input length since every token needs to be stored as both keys and values. This leads to increased memory bandwidth usage and can cause out-of-memory errors on GPUs with limited resources.

The Solution: RocketKV

RocketKV presents a novel solution for accelerating Long-Context LLM Inference by efficiently compressing the KV cache without sacrificing accuracy. The approach consists of two stages: coarse-grain KV cache eviction using SnapKV++ in the first stage and fine-grain top-k sparse attention through a hybrid attention method in the second stage.

Stage 1: Coarse-Grain KV Cache Eviction with SnapKV++

SnapKV++, an improved version of the original SnapKV, is used for coarse-grain KV cache eviction. It uses a grouped-query attention mechanism to reduce memory usage by grouping queries into smaller sets and attending only to those groups instead of all tokens. This reduces the number of keys and values needed to be stored, resulting in lower memory bandwidth usage.

Stage 2: Fine-Grain Top-K Sparse Attention

In this stage, a hybrid attention method is used to further reduce memory requirements while maintaining accuracy. Instead of attending over all previous tokens, only the top-k most relevant tokens are attended over using a sparse attention mechanism. This allows for efficient retrieval from the compressed KV cache while still capturing important information from extended contexts.

Results and Impact

RocketKV was evaluated on various long-context tasks such as language modeling, machine translation, and summarization. The experiments showed that RocketKV delivers up to 3x end-to-end speedup and up to 31% peak memory reduction on an NVIDIA H100 GPU compared to full KV cache baseline methods. Notably, RocketKV shows negligible accuracy loss across different tasks and varying kernel sizes under RULER (a benchmark suite for long-context LLMs). Ablation studies also confirmed the effectiveness of RocketKV, particularly highlighting the superiority of SnapKV with grouped-query attention enhancement over the original SnapKV in terms of performance with low token budgets. The impact of this research paper lies in its ability to significantly improve inference times and resource efficiency for LLMs, which are crucial for real-world applications. The proposed approach can also be applied to other models that use key-value caching, such as BERT and GPT-3.

Conclusion

In conclusion, RocketKV presents a novel solution for accelerating Long-Context LLM Inference by efficiently compressing KV cache without sacrificing accuracy. By using a combination of coarse-grain eviction and fine-grain top-k sparse attention, RocketKV effectively reduces memory bandwidth and storage demands of the KV cache during decoding while maintaining comparable accuracy to full KV cache attention. This research has significant implications for improving the efficiency and scalability of LLMs in various natural language processing tasks.

Created on 24 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

63.8%

A Comprehensive Survey on Long Context Language Modeling

cs.CL

63.1%

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

cs.CL

61.2%

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

cs.CL

60.2%

Speed Always Wins: A Survey on Efficient Architectures for Large Language Mod…

cs.CL

58.3%

M+: Extending MemoryLLM with Scalable Long-Term Memory

cs.CL

57.5%

RWKV: Reinventing RNNs for the Transformer Era

cs.CL

56.6%

Foundations of Large Language Models

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.