Kimi Linear: An Expressive, Efficient Attention Architecture
AI-generated keywords:
Kimi Linear
attention architecture
hybrid linear attention
Kimi Delta Attention (KDA)
Gated DeltaNet
- Kimi Team introduces Kimi Linear, a groundbreaking model surpassing full attention models in various scenarios
- Kimi Linear features an innovative linear attention module enhancing Gated DeltaNet with a more intricate gating mechanism
- Custom chunkwise algorithm ensures high hardware efficiency by reducing computational requirements with specialized DPLR transition matrices
- Pretrained Kimi Linear model with 3 billion activated parameters and 48 billion total parameters using layerwise fusion of KDA and MLA
- Demonstrated superior performance across all assessed tasks, decreasing KV cache usage by up to 75% and achieving up to six times decoding throughput for a context size of 1 million
- Kimi Linear shows potential as a seamless replacement for full attention architectures, offering superior performance and efficiency even in tasks with extended input and output lengths
- The team has made KDA kernel and vLLM implementations open-source, providing access to pre-trained and instruction-tuned model checkpoints to facilitate further research
Authors:
Kimi Team,
Yu Zhang,
Zongyu Lin,
Xingcheng Yao,
Jiaxi Hu,
Fanqing Meng,
Chengyin Liu,
Xin Men,
Songlin Yang,
Zhiyuan Li,
Wentao Li,
Enzhe Lu,
Weizhou Liu,
Yanru Chen,
Weixin Xu,
Longhui Yu,
Yejie Wang,
Yu Fan,
Longguang Zhong,
Enming Yuan,
Dehao Zhang,
Yizhi Zhang,
T. Y. Liu,
Haiming Wang,
Shengjun Fang,
Weiran He,
Shaowei Liu,
Yiwei Li,
Jianlin Su,
Jiezhong Qiu,
Bo Pang,
Junjie Yan,
Zhejun Jiang,
Weixiao Huang,
Bohong Yin,
Jiacheng You,
Chu Wei,
Zhengtao Wang,
Chao Hong,
Yutian Chen,
Guanduo Chen,
Yucheng Wang,
Huabin Zheng,
Feng Wang,
Yibo Liu,
Mengnan Dong,
Zheng Zhang,
Siyuan Pan,
Wenhao Wu,
Yuhao Wu,
Longyu Guan,
Jiawen Tao,
Guohong Fu,
Xinran Xu,
Yuzhi Wang,
Guokun Lai,
Yuxin Wu,
Xinyu Zhou,
Zhilin Yang,
Yulun Du
Abstract: We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
Submitted to arXiv on 30 Oct. 2025
- Comprehensive Summary
- Key points
- Layman's Summary
- Blog article
The Kimi Team introduces , a groundbreaking that surpasses full attention models in various scenarios. This includes short-context, long-context, and reinforcement learning scaling regimes. At the heart of Kimi Linear is , an innovative linear attention module that enhances Gated DeltaNet with a more intricate gating mechanism. This enables more efficient utilization of finite-state RNN memory. The team's custom chunkwise algorithm ensures high hardware efficiency by employing a specialized version of Diagonal-Plus-Low-Rank (DPLR) transition matrices. This significantly reduces computational requirements while maintaining alignment with the classical delta rule. The team pretrains a Kimi Linear model with 3 billion activated parameters and 48 billion total parameters. They achieve this by utilizing a layerwise fusion of KDA and Multi-Head Latent Attention (MLA). Through rigorous experimentation, they demonstrate that by a substantial margin across all assessed tasks. Additionally, it decreases KV cache usage by up to 75% and achieves up to six times decoding throughput for a context size of 1 million. These results underscore the potential for Kimi Linear to serve as a seamless replacement for full attention architectures. It offers superior performance and efficiency even in tasks with extended input and output lengths. To facilitate further research in this domain, the team has made the KDA kernel and vLLM implementations open-source. They have also provided access to pre-trained and instruction-tuned model checkpoints. This comprehensive effort not only showcases the advancements achieved through Kimi Linear but also sets the stage for continued exploration and innovation in attention architectures.
- - Kimi Team introduces Kimi Linear, a groundbreaking model surpassing full attention models in various scenarios
- - Kimi Linear features an innovative linear attention module enhancing Gated DeltaNet with a more intricate gating mechanism
- - Custom chunkwise algorithm ensures high hardware efficiency by reducing computational requirements with specialized DPLR transition matrices
- - Pretrained Kimi Linear model with 3 billion activated parameters and 48 billion total parameters using layerwise fusion of KDA and MLA
- - Demonstrated superior performance across all assessed tasks, decreasing KV cache usage by up to 75% and achieving up to six times decoding throughput for a context size of 1 million
- - Kimi Linear shows potential as a seamless replacement for full attention architectures, offering superior performance and efficiency even in tasks with extended input and output lengths
- - The team has made KDA kernel and vLLM implementations open-source, providing access to pre-trained and instruction-tuned model checkpoints to facilitate further research
Summary- Kimi Team created a new model called Kimi Linear that is better than other models in different situations.
- Kimi Linear has a special module that helps it work even better, called the linear attention module.
- A special algorithm in Kimi Linear makes it use less computer power and work faster.
- The Kimi Linear model has many parameters and works well for different tasks.
- Kimi Linear can replace other models and do tasks faster and more efficiently.
Definitions- Model: A way of doing things or a plan to solve a problem.
- Module: A part of something that does a specific job or task.
- Algorithm: A set of steps to follow to solve a problem or complete a task.
- Parameters: Factors or values that affect how something works or behaves.
- Efficiently: Doing something well without wasting time or resources.
Attention mechanisms have become an integral part of many state-of-the-art machine learning models, particularly in natural language processing (NLP) tasks. These mechanisms allow the model to focus on specific parts of the input data, improving its performance and efficiency. However, traditional attention models have limitations when it comes to handling long sequences of data and scaling to larger datasets. This is where the groundbreaking research paper by The Kimi Team comes into play.
Titled "Kimi Linear: A Novel Linear Attention Module for Improved Performance in Various Scenarios," this paper introduces a new attention mechanism that surpasses full attention models in various scenarios. The team's approach, called Kimi Linear, not only outperforms existing methods but also offers improved hardware efficiency.
At the core of Kimi Linear is an innovative linear attention module that enhances Gated DeltaNet with a more intricate gating mechanism. This allows for more efficient utilization of finite-state recurrent neural network (RNN) memory. In simpler terms, this means that the model can process longer sequences without compromising its performance or requiring excessive computational resources.
To further improve hardware efficiency, the team developed a custom chunkwise algorithm that utilizes a specialized version of Diagonal-Plus-Low-Rank (DPLR) transition matrices. This reduces computational requirements while maintaining alignment with the classical delta rule – a widely used learning algorithm in neural networks.
The team pre-trained their Kimi Linear model with 3 billion activated parameters and 48 billion total parameters using a layerwise fusion of Kernel Density Alignment (KDA) and Multi-Head Latent Attention (MLA). Through rigorous experimentation, they demonstrated that their approach outperforms existing methods by a substantial margin across all assessed tasks.
One key advantage of Kimi Linear is its ability to handle extended input and output lengths efficiently. It decreases KV cache usage by up to 75% and achieves up to six times decoding throughput for context sizes as large as 1 million. This makes it a promising candidate for tasks that require processing of long sequences, such as language translation or text summarization.
To facilitate further research in this domain, the team has made their KDA kernel and vLLM implementations open-source. They have also provided access to pre-trained and instruction-tuned model checkpoints. This not only showcases the advancements achieved through Kimi Linear but also sets the stage for continued exploration and innovation in attention architectures.
In conclusion, The Kimi Team's research paper introduces a groundbreaking attention mechanism – Kimi Linear – that offers superior performance and efficiency compared to traditional full attention models. Its ability to handle longer sequences without compromising its performance makes it a promising solution for various NLP tasks. The team's efforts in making their implementation open-source will undoubtedly pave the way for further advancements in this field. With its potential to serve as a seamless replacement for full attention architectures, Kimi Linear is poised to make significant contributions to the world of machine learning and artificial intelligence.