Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

AI-generated keywords: Artificial General Intelligence Multi-modal Large Language Models Video-MME Benchmark Performance Evaluation

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Multi-modal Large Language Models (MLLMs) are a key focus in the pursuit of artificial general intelligence
  • Progress has been made in static image understanding but a gap exists in processing sequential visual data
  • Video-MME is the first-ever benchmark for assessing MLLMs in video analysis
  • Video-MME offers diversity in video types, duration, breadth in data modalities, and quality annotations
  • Encompasses six primary visual domains with 30 subfields to ensure broad scenario generalizability
  • Dataset consists of 900 videos totaling 256 hours with 2,700 question-answer pairs for model evaluation
  • Gemini 1.5 Pro identified as best-performing commercial model but need for enhancements in handling longer sequences and multi-modal data persists
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun

Project Page: https://video-mme.github.io

Abstract: In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 256 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io

Submitted to arXiv on 31 May. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2405.21075v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the pursuit of artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a key focus in recent advancements. Significant progress has been made in enhancing their capabilities in static image understanding. However, there remains a notable gap in exploring their potential for processing sequential visual data. This deficiency underscores the need for a comprehensive and high-quality evaluation of MLLMs' performance in this domain. To address this gap, we present Video-MME, the first-ever full-spectrum Multi-Modal Evaluation benchmark designed specifically for assessing MLLMs in video analysis. Setting itself apart from existing benchmarks, Video-MME offers four distinctive features: diversity in video types, duration in temporal dimension, breadth in data modalities, and quality in annotations. Encompassing six primary visual domains with 30 subfields to ensure broad scenario generalizability, Video-MME covers short-, medium-, and long-term videos ranging from 11 seconds to 1 hour to capture robust contextual dynamics. It also integrates multi-modal inputs beyond video frames such as subtitles and audios to showcase the all-round capabilities of MLLMs. Employing rigorous manual labeling by expert annotators enables precise and reliable model assessment. The dataset for Video-MME consists of 900 videos totaling 256 hours that were manually selected and annotated through repeated viewing to generate 2,700 question-answer pairs. This extensive dataset allows for thorough evaluation of various state-of-the-art MLLMs such as the GPT-4 series and Gemini 1.5 Pro alongside open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro stands out as the best-performing commercial model with significant superiority over open-source alternatives. However, these findings also highlight the ongoing need for further enhancements in handling longer sequences and multi-modal data within MLLMs. Through Video-MME's meticulous evaluation process and comprehensive dataset, we aim to drive advancements in leveraging Multi-modal Large Language Models for enhanced performance in video analysis tasks. For more information on this benchmark project, please visit our Project Page at https://video-mme.github.io.
Created on 24 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.