InternVideo: General Video Foundation Models via Generative and Discriminative Learning

AI-generated keywords: InternVideo video foundation model self-supervised learning cross-model learning ethical considerations

AI-generated Key Points

Introduction of a novel video foundation model, InternVideo, combining generative and discriminative self-supervised video learning techniques
Focus on masked video modeling and video-language contrastive learning as pretraining objectives
Achieves state-of-the-art performance on 39 video datasets across various tasks such as action recognition/detection, video-language alignment, and open-world applications
Efficiency in training with significantly less power consumption compared to other models like CoCa
Leveraging a unified video representation based on cross-model learning for generalized spatiotemporal representations
Consistently delivers notable performance improvements even in zero-shot and open-set scenarios
Potential limitations in handling long-term or high-order video tasks like anticipating plots from movie segments
Future work could focus on expanding capabilities to address these challenges and enhance generality of video representation learning
Emphasis on ethical considerations in data collection for training models like InternVideo, ensuring responsible use of curated datasets
Importance of exploring issues related to bias, risks, fairness, equality, and other social topics for comprehensive impact assessment in the field of AI research

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao

arXiv: 2212.03191v1 - DOI (cs.CV)

technical report

License: CC BY 4.0

Abstract: The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

Submitted to arXiv on 06 Dec. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2212.03191v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

In this paper, we introduce , a novel video foundation model that combines generative and discriminative self-supervised video learning techniques to enhance video-level understanding tasks. Unlike existing vision foundation models that primarily focus on image-level pretraining and adaptation, explores masked video modeling and video-language contrastive learning as pretraining objectives. By selectively coordinating the representations from these two frameworks in a learnable manner, achieves state-of-the-art performance on 39 video datasets across various tasks such as action recognition/detection, video-language alignment, and open-world applications. stands out for its efficiency in training, consuming significantly less power compared to other models like CoCa. The model leverages a unified video representation based on cross-model learning, leading to generalized spatiotemporal representations with impressive results across different datasets. Even in zero-shot and open-set scenarios, consistently delivers notable performance improvements. While excels in current popular video perception tasks using clips, it may face limitations when handling long-term or high-order video tasks like anticipating plots from movie segments. Future work could focus on expanding the model's capabilities to address these challenges and further enhance the generality of video representation learning. Moreover, ethical considerations are paramount in data collection for training models like . Queries used for gathering data are carefully vetted for ethical and legal compliance to ensure responsible use of curated datasets. As research progresses in the field of video understanding, it is essential to explore issues related to bias, risks, fairness, equality, and other social topics for a comprehensive impact assessment. Overall,< kd>InternVideo</kd> represents a significant advancement in the realm of video foundation models by achieving superior performance across diverse tasks while maintaining efficiency in training. Its success underscores the potential for further innovation in enhancing video understanding capabilities and addressing broader societal implications within the field of AI research.

- Introduction of a novel video foundation model, InternVideo, combining generative and discriminative self-supervised video learning techniques
- Focus on masked video modeling and video-language contrastive learning as pretraining objectives
- Achieves state-of-the-art performance on 39 video datasets across various tasks such as action recognition/detection, video-language alignment, and open-world applications
- Efficiency in training with significantly less power consumption compared to other models like CoCa
- Leveraging a unified video representation based on cross-model learning for generalized spatiotemporal representations
- Consistently delivers notable performance improvements even in zero-shot and open-set scenarios
- Potential limitations in handling long-term or high-order video tasks like anticipating plots from movie segments
- Future work could focus on expanding capabilities to address these challenges and enhance generality of video representation learning
- Emphasis on ethical considerations in data collection for training models like InternVideo, ensuring responsible use of curated datasets
- Importance of exploring issues related to bias, risks, fairness, equality, and other social topics for comprehensive impact assessment in the field of AI research

Summary1. A new video model called InternVideo was introduced, combining different ways to learn from videos. 2. It focuses on specific types of learning tasks before the main training starts. 3. This model performs very well on many different video tasks and uses less power than other models. 4. It creates a single way to understand videos for better results. 5. There are still some challenges to solve in understanding long or complex videos. Definitions- Novel: New and original - Foundation: The base or starting point - Model: A representation or example used for study or imitation - Generative: Creating something new - Discriminative: Distinguishing between different things - Self-supervised: Learning without external guidance - Pretraining: Preparing for the main task - State-of-the-art: The best currently available - Efficiency: Doing something well with minimal waste - Power consumption: Amount of energy used - Unified: Combined into one - Spatiotemporal: Involving both space and time dimensions

Introduction

In recent years, there has been a significant increase in the use of video data for various applications such as action recognition, video-language alignment, and open-world tasks. However, understanding videos at a deep level remains a challenging task due to their complex spatiotemporal nature. To address this issue, researchers have developed vision foundation models that focus on image-level pretraining and adaptation. These models have shown promising results but are limited in their ability to handle long-term or high-order video tasks. To overcome these limitations, a team of researchers from top universities and research institutions including MIT and Google AI have introduced InternVideo, a novel video foundation model that combines generative and discriminative self-supervised learning techniques to enhance video-level understanding tasks. This groundbreaking research paper presents an innovative approach to video representation learning by leveraging masked video modeling and contrastive learning as pretraining objectives.

The Model: InternVideo

Unlike existing vision foundation models that primarily focus on image-level pretraining and adaptation, InternVideo explores masked video modeling and contrastive learning as pretraining objectives. By selectively coordinating the representations from these two frameworks in a learnable manner, the model achieves state-of-the-art performance on 39 different datasets across various tasks such as action recognition/detection, video-language alignment, and open-world applications. One of the key strengths of InternVideo is its efficiency in training. The model consumes significantly less power compared to other models like CoCa while still achieving superior performance across diverse tasks. This is made possible through its unified cross-model learning approach which leads to generalized spatiotemporal representations.

Cross-Model Learning for Generalized Representations

The core idea behind InternVideo's success lies in its unified cross-model learning approach. This allows the model to learn from different modalities such as images, text, and audio simultaneously, leading to a more comprehensive understanding of videos. By leveraging this cross-model learning, InternVideo is able to generate spatiotemporal representations that are not only accurate but also generalize well across different datasets.

Performance on Zero-Shot and Open-Set Scenarios

Another impressive aspect of InternVideo is its performance in zero-shot and open-set scenarios. In these scenarios, where the model has not been trained on a particular dataset or task, it still delivers notable performance improvements compared to other models. This further highlights the robustness and generalizability of InternVideo's video representation learning capabilities.

Potential Limitations and Future Work

While InternVideo excels in current popular video perception tasks using clips, it may face limitations when handling long-term or high-order video tasks like anticipating plots from movie segments. This could be an area for future work to expand the model's capabilities and address these challenges. Moreover, ethical considerations are paramount in data collection for training models like InternVideo. The team behind this research paper ensures responsible use of curated datasets by carefully vetting queries used for gathering data for ethical and legal compliance. As research progresses in the field of video understanding, it is essential to explore issues related to bias, risks, fairness, equality, and other social topics for a comprehensive impact assessment.

The Impact of InternVideo on Video Understanding Research

The introduction of InternVideo represents a significant advancement in the realm of video foundation models. Its ability to achieve superior performance across diverse tasks while maintaining efficiency in training underscores its potential for further innovation in enhancing video understanding capabilities. This research paper opens up new possibilities for future studies in the field of video representation learning and its applications.

Conclusion

In conclusion, InternVideo is a groundbreaking video foundation model that combines generative and discriminative self-supervised learning techniques to enhance video-level understanding tasks. Its unified cross-model learning approach leads to generalized spatiotemporal representations with impressive results across different datasets. While there may be limitations in handling long-term or high-order video tasks, InternVideo's success highlights the potential for further innovation in this field. As research progresses, it is crucial to consider ethical implications and address issues related to bias and fairness for responsible use of curated datasets. Overall, InternVideo represents a significant step towards enhancing video understanding capabilities and addressing broader societal implications within the field of AI research.

Created on 25 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

71.3%

VindLU: A Recipe for Effective Video-and-Language Pretraining

cs.CV

69.5%

Scaling 4D Representations

cs.CV

69.4%

VideoMamba: State Space Model for Efficient Video Understanding

cs.CV

68.2%

HawkEye: Training Video-Text LLMs for Grounding Text in Videos

cs.CV

67.6%

VideoPoet: A Large Language Model for Zero-Shot Video Generation

cs.CV

66.5%

Foundational Models Defining a New Era in Vision: A Survey and Outlook

cs.CV

65.9%

Tuning Large Multimodal Models for Videos using Reinforcement Learning from A…

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.