Kimi-VL Technical Report

AI-generated keywords: Mystery

AI-generated Key Points

  • Captivating blend of mystery, spirituality, and natural grandeur
  • Opening scene with a person cooking in a dimly lit room creates anticipation
  • Transition to elderly person spinning a prayer wheel evokes themes of resilience and contemplation
  • Serene landscape enhances sense of adventure and natural beauty
  • Close-ups of an eye and prayer wheel hint at personal stories within majestic setting
  • Scenes capture power of crashing waves, serenity beneath the surface, and grandeur of mountain ranges
  • Room filled with candles shows elderly person in deep contemplation, conveying reverence and wisdom
  • Culmination in view of majestic mountain range emphasizes awe-inspiring moments in nature
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haoning Wu, Haotian Yao, Haoyu Lu, Heng Wang, Hongcheng Gao, Huabin Zheng, Jiaming Li, Jianlin Su, Jianzhou Wang, Jiaqi Deng, Jiezhong Qiu, Jin Xie, Jinhong Wang, Jingyuan Liu, Junjie Yan, Kun Ouyang, Liang Chen, Lin Sui, Longhui Yu, Mengfan Dong, Mengnan Dong, Nuo Xu, Pengyu Cheng, Qizheng Gu, Runjie Zhou, Shaowei Liu, Sihan Cao, Tao Yu, Tianhui Song, Tongtong Bai, Wei Song, Weiran He, Weixiao Huang, Weixin Xu, Xiaokun Yuan, Xingcheng Yao, Xingzhe Wu, Xinhao Li, Xinxing Zu, Xinyu Zhou, Xinyuan Wang, Y. Charles, Yan Zhong, Yang Li, Yangyang Hu, Yanru Chen, Yejie Wang, Yibo Liu, Yibo Miao, Yidao Qin, Yimin Chen, Yiping Bao, Yiqin Wang, Yongsheng Kang, Yuanxin Liu, Yuhao Dong, Yulun Du, Yuxin Wu, Yuzhi Wang, Yuzi Yan, Zaida Zhou, Zhaowei Li, Zhejun Jiang, Zheng Zhang, Zhilin Yang, Zhiqi Huang, Zihao Huang, Zijia Zhao, Ziwei Chen, Zongyu Lin

Updated Kimi-VL-A3B-Thinking-2506 information
License: CC BY-NC-SA 4.0

Abstract: We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Submitted to arXiv on 10 Apr. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.07491v3

The video unfolds with a captivating blend of mystery, spirituality, and natural grandeur. The opening scene sets the tone with a person cooking in a dimly lit room, creating an atmosphere of anticipation. As the text "THE NORTH FACE PRESENTS" appears on screen, the stage is set for the video's theme. Transitioning to an elderly person spinning a prayer wheel, the focus shifts to their weathered face and intricate details of their yellow jacket. This evokes themes of resilience and contemplation. The serene landscape further enhances the sense of adventure and natural beauty. Close-ups of an eye and a prayer wheel hint at personal stories within this majestic setting. Subsequent scenes capture the power of crashing waves, the serenity beneath the surface, and the grandeur of mountain ranges, emphasizing awe-inspiring moments in nature. Moving indoors to a room filled with candles, an elderly person is shown in deep contemplation while holding a prayer wheel. The intricate details of their attire and surroundings convey a sense of reverence and wisdom. The scene transitions between close-ups of the prayer wheel and the elderly person's face before culminating in a view of the majestic mountain range in the background. Overall, this video masterfully weaves together elements of mystery, spirituality, and natural beauty to create a visually stunning narrative that invites viewers to reflect on life's complexities and marvel at the wonders of the world around them.
Created on 12 May. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.