Kimi Linear: An Expressive, Efficient Attention Architecture

AI-generated keywords: Kimi Linear attention architecture hybrid linear attention Kimi Delta Attention (KDA) Gated DeltaNet

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Kimi Team introduces Kimi Linear, a groundbreaking model surpassing full attention models in various scenarios
Kimi Linear features an innovative linear attention module enhancing Gated DeltaNet with a more intricate gating mechanism
Custom chunkwise algorithm ensures high hardware efficiency by reducing computational requirements with specialized DPLR transition matrices
Pretrained Kimi Linear model with 3 billion activated parameters and 48 billion total parameters using layerwise fusion of KDA and MLA
Demonstrated superior performance across all assessed tasks, decreasing KV cache usage by up to 75% and achieving up to six times decoding throughput for a context size of 1 million
Kimi Linear shows potential as a seamless replacement for full attention architectures, offering superior performance and efficiency even in tasks with extended input and output lengths
The team has made KDA kernel and vLLM implementations open-source, providing access to pre-trained and instruction-tuned model checkpoints to facilitate further research

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, Wentao Li, Enzhe Lu, Weizhou Liu, Yanru Chen, Weixin Xu, Longhui Yu, Yejie Wang, Yu Fan, Longguang Zhong, Enming Yuan, Dehao Zhang, Yizhi Zhang, T. Y. Liu, Haiming Wang, Shengjun Fang, Weiran He, Shaowei Liu, Yiwei Li, Jianlin Su, Jiezhong Qiu, Bo Pang, Junjie Yan, Zhejun Jiang, Weixiao Huang, Bohong Yin, Jiacheng You, Chu Wei, Zhengtao Wang, Chao Hong, Yutian Chen, Guanduo Chen, Yucheng Wang, Huabin Zheng, Feng Wang, Yibo Liu, Mengnan Dong, Zheng Zhang, Siyuan Pan, Wenhao Wu, Yuhao Wu, Longyu Guan, Jiawen Tao, Guohong Fu, Xinran Xu, Yuzhi Wang, Guokun Lai, Yuxin Wu, Xinyu Zhou, Zhilin Yang, Yulun Du

arXiv: 2510.26692v2 - DOI (cs.CL)

Kimi Linear tech report

License: CC BY-NC-ND 4.0

Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

Submitted to arXiv on 30 Oct. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2510.26692v2

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

The Kimi Team introduces , a groundbreaking that surpasses full attention models in various scenarios. This includes short-context, long-context, and reinforcement learning scaling regimes. At the heart of Kimi Linear is , an innovative linear attention module that enhances Gated DeltaNet with a more intricate gating mechanism. This enables more efficient utilization of finite-state RNN memory. The team's custom chunkwise algorithm ensures high hardware efficiency by employing a specialized version of Diagonal-Plus-Low-Rank (DPLR) transition matrices. This significantly reduces computational requirements while maintaining alignment with the classical delta rule. The team pretrains a Kimi Linear model with 3 billion activated parameters and 48 billion total parameters. They achieve this by utilizing a layerwise fusion of KDA and Multi-Head Latent Attention (MLA). Through rigorous experimentation, they demonstrate that by a substantial margin across all assessed tasks. Additionally, it decreases KV cache usage by up to 75% and achieves up to six times decoding throughput for a context size of 1 million. These results underscore the potential for Kimi Linear to serve as a seamless replacement for full attention architectures. It offers superior performance and efficiency even in tasks with extended input and output lengths. To facilitate further research in this domain, the team has made the KDA kernel and vLLM implementations open-source. They have also provided access to pre-trained and instruction-tuned model checkpoints. This comprehensive effort not only showcases the advancements achieved through Kimi Linear but also sets the stage for continued exploration and innovation in attention architectures.

- Kimi Team introduces Kimi Linear, a groundbreaking model surpassing full attention models in various scenarios
- Kimi Linear features an innovative linear attention module enhancing Gated DeltaNet with a more intricate gating mechanism
- Custom chunkwise algorithm ensures high hardware efficiency by reducing computational requirements with specialized DPLR transition matrices
- Pretrained Kimi Linear model with 3 billion activated parameters and 48 billion total parameters using layerwise fusion of KDA and MLA
- Demonstrated superior performance across all assessed tasks, decreasing KV cache usage by up to 75% and achieving up to six times decoding throughput for a context size of 1 million
- Kimi Linear shows potential as a seamless replacement for full attention architectures, offering superior performance and efficiency even in tasks with extended input and output lengths
- The team has made KDA kernel and vLLM implementations open-source, providing access to pre-trained and instruction-tuned model checkpoints to facilitate further research

Summary- Kimi Team created a new model called Kimi Linear that is better than other models in different situations. - Kimi Linear has a special module that helps it work even better, called the linear attention module. - A special algorithm in Kimi Linear makes it use less computer power and work faster. - The Kimi Linear model has many parameters and works well for different tasks. - Kimi Linear can replace other models and do tasks faster and more efficiently. Definitions- Model: A way of doing things or a plan to solve a problem. - Module: A part of something that does a specific job or task. - Algorithm: A set of steps to follow to solve a problem or complete a task. - Parameters: Factors or values that affect how something works or behaves. - Efficiently: Doing something well without wasting time or resources.

Attention mechanisms have become an integral part of many state-of-the-art machine learning models, particularly in natural language processing (NLP) tasks. These mechanisms allow the model to focus on specific parts of the input data, improving its performance and efficiency. However, traditional attention models have limitations when it comes to handling long sequences of data and scaling to larger datasets. This is where the groundbreaking research paper by The Kimi Team comes into play. Titled "Kimi Linear: A Novel Linear Attention Module for Improved Performance in Various Scenarios," this paper introduces a new attention mechanism that surpasses full attention models in various scenarios. The team's approach, called Kimi Linear, not only outperforms existing methods but also offers improved hardware efficiency. At the core of Kimi Linear is an innovative linear attention module that enhances Gated DeltaNet with a more intricate gating mechanism. This allows for more efficient utilization of finite-state recurrent neural network (RNN) memory. In simpler terms, this means that the model can process longer sequences without compromising its performance or requiring excessive computational resources. To further improve hardware efficiency, the team developed a custom chunkwise algorithm that utilizes a specialized version of Diagonal-Plus-Low-Rank (DPLR) transition matrices. This reduces computational requirements while maintaining alignment with the classical delta rule – a widely used learning algorithm in neural networks. The team pre-trained their Kimi Linear model with 3 billion activated parameters and 48 billion total parameters using a layerwise fusion of Kernel Density Alignment (KDA) and Multi-Head Latent Attention (MLA). Through rigorous experimentation, they demonstrated that their approach outperforms existing methods by a substantial margin across all assessed tasks. One key advantage of Kimi Linear is its ability to handle extended input and output lengths efficiently. It decreases KV cache usage by up to 75% and achieves up to six times decoding throughput for context sizes as large as 1 million. This makes it a promising candidate for tasks that require processing of long sequences, such as language translation or text summarization. To facilitate further research in this domain, the team has made their KDA kernel and vLLM implementations open-source. They have also provided access to pre-trained and instruction-tuned model checkpoints. This not only showcases the advancements achieved through Kimi Linear but also sets the stage for continued exploration and innovation in attention architectures. In conclusion, The Kimi Team's research paper introduces a groundbreaking attention mechanism – Kimi Linear – that offers superior performance and efficiency compared to traditional full attention models. Its ability to handle longer sequences without compromising its performance makes it a promising solution for various NLP tasks. The team's efforts in making their implementation open-source will undoubtedly pave the way for further advancements in this field. With its potential to serve as a seamless replacement for full attention architectures, Kimi Linear is poised to make significant contributions to the world of machine learning and artificial intelligence.

Created on 12 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

55.2%

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference

cs.CL

54.4%

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-…

cs.CL

54.4%

Attention Is All You Need

cs.CL

54.0%

System 2 Attention (is something you might need too)

cs.CL

53.4%

Linearizing Transformer with Key-Value Memory Bank

cs.CL

53.4%

Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

cs.CL

53.1%

Ring Attention with Blockwise Transformers for Near-Infinite Context

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.