RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

AI-generated keywords: Transformer-based Large Language Models

AI-generated Key Points

  • Transformer-based Large Language Models heavily rely on KV cache for extended context handling during decoding
  • RocketKV introduced as a training-free KV cache compression strategy
  • Consists of coarse-grain KV cache eviction using SnapKV++ and fine-grain top-k sparse attention through hybrid attention method
  • RocketKV reduces memory bandwidth and storage demands during decoding while maintaining accuracy comparable to full KV cache attention
  • Up to 3x end-to-end speedup and up to 31% peak memory reduction on NVIDIA H100 GPU compared to full KV cache baseline
  • Negligible accuracy loss across various long-context tasks with RocketKV
  • Varying kernel sizes under RULER with different sequence lengths lead to linear improvements in speedup and peak memory savings
  • Ablation studies confirm efficacy of RocketKV, highlighting superiority of SnapKV with grouped-query attention enhancement over original SnapKV for performance with low token budgets
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

License: CC BY 4.0

Abstract: Transformer-based Large Language Models rely critically on KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy designed specifically to reduce both memory bandwidth and capacity demand of KV cache during the decode phase. RocketKV contains two consecutive stages. In the first stage, it performs coarse-grain KV cache eviction on the input sequence tokens with SnapKV++, a method improved upon SnapKV by introducing adaptive pooling size and full compatibility with grouped-query attention. In the second stage, it adopts a hybrid attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensional reductions. Combining these two stages, RocketKV achieves significant KV cache fetching bandwidth and storage savings while maintaining comparable accuracy to full KV cache attention. We show that RocketKV provides end-to-end speedup by up to 3$\times$ as well as peak memory reduction by up to 31% in the decode phase on an NVIDIA H100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks.

Submitted to arXiv on 19 Feb. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2502.14051v1

, , , , Transformer-based Large Language Models heavily rely on KV cache to efficiently handle extended contexts during the decoding phase. However, the size of the KV cache grows with the input length, causing memory bandwidth and capacity issues as decoding progresses. To tackle this challenge, RocketKV is introduced as a training-free KV cache compression strategy. It consists of two stages: coarse-grain KV cache eviction using SnapKV++ in the first stage and fine-grain top-k sparse attention through a hybrid attention method in the second stage. RocketKV effectively reduces memory bandwidth and storage demands of the KV cache during decoding while maintaining accuracy comparable to full KV cache attention. The approach delivers up to 3x end-to-end speedup and up to 31% peak memory reduction on an NVIDIA H100 GPU compared to the full KV cache baseline. <break> Notably, RocketKV shows negligible accuracy loss across various long-context tasks. Furthermore, experiments demonstrate that varying kernel sizes under RULER with different sequence lengths lead to linear improvements in both speedup and peak memory savings. Ablation studies confirm the efficacy of RocketKV, particularly highlighting the superiority of SnapKV with grouped-query attention enhancement over the original SnapKV in terms of performance with low token budgets. In conclusion, <keyword>RocketKV presents a novel solution for accelerating Long-Context LLM Inference by efficiently compressing KV cache without sacrificing accuracy,</keyword> showcasing significant improvements in speedup and memory efficiency compared to traditional methods.
Created on 24 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.