Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

AI-generated keywords: Artificial General Intelligence Multi-modal Large Language Models Video-MME Benchmark Performance Evaluation

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Multi-modal Large Language Models (MLLMs) are a key focus in the pursuit of artificial general intelligence
Progress has been made in static image understanding but a gap exists in processing sequential visual data
Video-MME is the first-ever benchmark for assessing MLLMs in video analysis
Video-MME offers diversity in video types, duration, breadth in data modalities, and quality annotations
Encompasses six primary visual domains with 30 subfields to ensure broad scenario generalizability
Dataset consists of 900 videos totaling 256 hours with 2,700 question-answer pairs for model evaluation
Gemini 1.5 Pro identified as best-performing commercial model but need for enhancements in handling longer sequences and multi-modal data persists

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun

arXiv: 2405.21075v1 - DOI (cs.CV)

Project Page: https://video-mme.github.io

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: In the quest for artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a focal point in recent advancements. However, the predominant focus remains on developing their capabilities in static image understanding. The potential of MLLMs in processing sequential visual data is still insufficiently explored, highlighting the absence of a comprehensive, high-quality assessment of their performance. In this paper, we introduce Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis. Our work distinguishes from existing benchmarks through four key features: 1) Diversity in video types, spanning 6 primary visual domains with 30 subfields to ensure broad scenario generalizability; 2) Duration in temporal dimension, encompassing both short-, medium-, and long-term videos, ranging from 11 seconds to 1 hour, for robust contextual dynamics; 3) Breadth in data modalities, integrating multi-modal inputs besides video frames, including subtitles and audios, to unveil the all-round capabilities of MLLMs; 4) Quality in annotations, utilizing rigorous manual labeling by expert annotators to facilitate precise and reliable model assessment. 900 videos with a total of 256 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models. Our dataset along with these findings underscores the need for further improvements in handling longer sequences and multi-modal data. Project Page: https://video-mme.github.io

Submitted to arXiv on 31 May. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2405.21075v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the pursuit of artificial general intelligence, Multi-modal Large Language Models (MLLMs) have emerged as a key focus in recent advancements. Significant progress has been made in enhancing their capabilities in static image understanding. However, there remains a notable gap in exploring their potential for processing sequential visual data. This deficiency underscores the need for a comprehensive and high-quality evaluation of MLLMs' performance in this domain. To address this gap, we present Video-MME, the first-ever full-spectrum Multi-Modal Evaluation benchmark designed specifically for assessing MLLMs in video analysis. Setting itself apart from existing benchmarks, Video-MME offers four distinctive features: diversity in video types, duration in temporal dimension, breadth in data modalities, and quality in annotations. Encompassing six primary visual domains with 30 subfields to ensure broad scenario generalizability, Video-MME covers short-, medium-, and long-term videos ranging from 11 seconds to 1 hour to capture robust contextual dynamics. It also integrates multi-modal inputs beyond video frames such as subtitles and audios to showcase the all-round capabilities of MLLMs. Employing rigorous manual labeling by expert annotators enables precise and reliable model assessment. The dataset for Video-MME consists of 900 videos totaling 256 hours that were manually selected and annotated through repeated viewing to generate 2,700 question-answer pairs. This extensive dataset allows for thorough evaluation of various state-of-the-art MLLMs such as the GPT-4 series and Gemini 1.5 Pro alongside open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Our experiments reveal that Gemini 1.5 Pro stands out as the best-performing commercial model with significant superiority over open-source alternatives. However, these findings also highlight the ongoing need for further enhancements in handling longer sequences and multi-modal data within MLLMs. Through Video-MME's meticulous evaluation process and comprehensive dataset, we aim to drive advancements in leveraging Multi-modal Large Language Models for enhanced performance in video analysis tasks. For more information on this benchmark project, please visit our Project Page at https://video-mme.github.io.

- Multi-modal Large Language Models (MLLMs) are a key focus in the pursuit of artificial general intelligence
- Progress has been made in static image understanding but a gap exists in processing sequential visual data
- Video-MME is the first-ever benchmark for assessing MLLMs in video analysis
- Video-MME offers diversity in video types, duration, breadth in data modalities, and quality annotations
- Encompasses six primary visual domains with 30 subfields to ensure broad scenario generalizability
- Dataset consists of 900 videos totaling 256 hours with 2,700 question-answer pairs for model evaluation
- Gemini 1.5 Pro identified as best-performing commercial model but need for enhancements in handling longer sequences and multi-modal data persists

Summary- Multi-modal Large Language Models (MLLMs) are like smart robots that can understand and talk in many different ways. - People have gotten better at understanding pictures, but they still need to improve on understanding videos that show things happening one after another. - Video-MME is a special test to see how well the smart robots can understand videos. - Video-MME has all kinds of videos with different lengths, types, and quality to test the smart robots. - There are six main categories of things that the smart robots need to learn about in videos. Definitions- Multi-modal Large Language Models (MLLMs): Smart robots that can understand and communicate using different types of information. - Sequential: Happening in a particular order, one after another. - Benchmark: A standard or test used to measure how well something performs compared to others. - Modalities: Different forms or types of data or information. - Annotations: Notes or explanations added to something to provide more details or context.

In recent years, there has been a growing interest in the development of artificial general intelligence (AGI). One key focus in this pursuit is Multi-modal Large Language Models (MLLMs), which have shown significant progress in static image understanding. However, there remains a notable gap in exploring their potential for processing sequential visual data. To address this deficiency, researchers have developed Video-MME, the first-ever full-spectrum Multi-Modal Evaluation benchmark designed specifically for assessing MLLMs in video analysis. Video-MME offers four distinctive features that set it apart from existing benchmarks. Firstly, it provides diversity in video types to ensure broad scenario generalizability. This includes six primary visual domains and 30 subfields such as sports, cooking, and news. Secondly, it encompasses videos of varying durations ranging from short-term (11 seconds) to medium-term (1 hour) to long-term (over 1 hour), capturing robust contextual dynamics. Thirdly, Video-MME incorporates multi-modal inputs beyond video frames such as subtitles and audios to showcase the all-round capabilities of MLLMs. Lastly, it employs rigorous manual labeling by expert annotators to ensure precise and reliable model assessment. The dataset for Video-MME consists of 900 videos totaling 256 hours that were manually selected and annotated through repeated viewing to generate 2,700 question-answer pairs. This extensive dataset allows for thorough evaluation of various state-of-the-art MLLMs such as the GPT-4 series and Gemini 1.5 Pro alongside open-source image models like InternVL-Chat-V1.5 and video models like LLaVA-NeXT-Video. Through experiments using Video-MME's comprehensive dataset, researchers found that Gemini 1.5 Pro stands out as the best-performing commercial model with significant superiority over open-source alternatives. However, these findings also highlight the ongoing need for further enhancements in handling longer sequences and multi-modal data within MLLMs. The meticulous evaluation process and extensive dataset of Video-MME aim to drive advancements in leveraging Multi-modal Large Language Models for enhanced performance in video analysis tasks. By providing a standardized benchmark, researchers can compare the capabilities of different MLLMs and identify areas for improvement. This will ultimately contribute to the development of more advanced AGI systems. For those interested in using Video-MME, the project page (https://video-mme.github.io) provides detailed information on the benchmark, including instructions on how to access and use the dataset. The website also includes a leaderboard showcasing the top-performing models on Video-MME, encouraging healthy competition among researchers working towards advancing MLLMs' capabilities in video analysis. In conclusion, Video-MME is an essential tool for evaluating MLLMs' performance in processing sequential visual data. Its comprehensive dataset and rigorous evaluation process provide valuable insights into these models' strengths and weaknesses. As research continues towards achieving artificial general intelligence, benchmarks like Video-MME play a crucial role in driving progress and pushing boundaries.

Created on 24 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

78.4%

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

cs.CV

63.1%

MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

cs.CV

63.0%

A Survey on Multimodal Large Language Models

cs.CV

61.9%

MHMS: Multimodal Hierarchical Multimedia Summarization

cs.CV

61.0%

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

cs.CV

59.4%

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and La…

cs.CV

59.2%

LLaVA-Critic: Learning to Evaluate Multimodal Models

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.