Kimi Linear: An Expressive, Efficient Attention Architecture

AI-generated keywords: Kimi Linear attention architecture hybrid linear attention Kimi Delta Attention (KDA) Gated DeltaNet

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Kimi Team introduces Kimi Linear, a groundbreaking model surpassing full attention models in various scenarios
  • Kimi Linear features an innovative linear attention module enhancing Gated DeltaNet with a more intricate gating mechanism
  • Custom chunkwise algorithm ensures high hardware efficiency by reducing computational requirements with specialized DPLR transition matrices
  • Pretrained Kimi Linear model with 3 billion activated parameters and 48 billion total parameters using layerwise fusion of KDA and MLA
  • Demonstrated superior performance across all assessed tasks, decreasing KV cache usage by up to 75% and achieving up to six times decoding throughput for a context size of 1 million
  • Kimi Linear shows potential as a seamless replacement for full attention architectures, offering superior performance and efficiency even in tasks with extended input and output lengths
  • The team has made KDA kernel and vLLM implementations open-source, providing access to pre-trained and instruction-tuned model checkpoints to facilitate further research
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du

Kimi Linear tech report
License: CC BY-NC-ND 4.0

Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

Submitted to arXiv on 30 Oct. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2510.26692v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The Kimi Team introduces , a groundbreaking that surpasses full attention models in various scenarios. This includes short-context, long-context, and reinforcement learning scaling regimes. At the heart of Kimi Linear is , an innovative linear attention module that enhances Gated DeltaNet with a more intricate gating mechanism. This enables more efficient utilization of finite-state RNN memory. The team's custom chunkwise algorithm ensures high hardware efficiency by employing a specialized version of Diagonal-Plus-Low-Rank (DPLR) transition matrices. This significantly reduces computational requirements while maintaining alignment with the classical delta rule. The team pretrains a Kimi Linear model with 3 billion activated parameters and 48 billion total parameters. They achieve this by utilizing a layerwise fusion of KDA and Multi-Head Latent Attention (MLA). Through rigorous experimentation, they demonstrate that by a substantial margin across all assessed tasks. Additionally, it decreases KV cache usage by up to 75% and achieves up to six times decoding throughput for a context size of 1 million. These results underscore the potential for Kimi Linear to serve as a seamless replacement for full attention architectures. It offers superior performance and efficiency even in tasks with extended input and output lengths. To facilitate further research in this domain, the team has made the KDA kernel and vLLM implementations open-source. They have also provided access to pre-trained and instruction-tuned model checkpoints. This comprehensive effort not only showcases the advancements achieved through Kimi Linear but also sets the stage for continued exploration and innovation in attention architectures.
Created on 12 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.