InternVideo: General Video Foundation Models via Generative and Discriminative Learning

AI-generated keywords: InternVideo video foundation model self-supervised learning cross-model learning ethical considerations

AI-generated Key Points

  • Introduction of a novel video foundation model, InternVideo, combining generative and discriminative self-supervised video learning techniques
  • Focus on masked video modeling and video-language contrastive learning as pretraining objectives
  • Achieves state-of-the-art performance on 39 video datasets across various tasks such as action recognition/detection, video-language alignment, and open-world applications
  • Efficiency in training with significantly less power consumption compared to other models like CoCa
  • Leveraging a unified video representation based on cross-model learning for generalized spatiotemporal representations
  • Consistently delivers notable performance improvements even in zero-shot and open-set scenarios
  • Potential limitations in handling long-term or high-order video tasks like anticipating plots from movie segments
  • Future work could focus on expanding capabilities to address these challenges and enhance generality of video representation learning
  • Emphasis on ethical considerations in data collection for training models like InternVideo, ensuring responsible use of curated datasets
  • Importance of exploring issues related to bias, risks, fairness, equality, and other social topics for comprehensive impact assessment in the field of AI research
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, Yu Qiao

technical report
License: CC BY 4.0

Abstract: The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

Submitted to arXiv on 06 Dec. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2212.03191v1

In this paper, we introduce , a novel video foundation model that combines generative and discriminative self-supervised video learning techniques to enhance video-level understanding tasks. Unlike existing vision foundation models that primarily focus on image-level pretraining and adaptation, explores masked video modeling and video-language contrastive learning as pretraining objectives. By selectively coordinating the representations from these two frameworks in a learnable manner, achieves state-of-the-art performance on 39 video datasets across various tasks such as action recognition/detection, video-language alignment, and open-world applications. stands out for its efficiency in training, consuming significantly less power compared to other models like CoCa. The model leverages a unified video representation based on cross-model learning, leading to generalized spatiotemporal representations with impressive results across different datasets. Even in zero-shot and open-set scenarios, consistently delivers notable performance improvements. While excels in current popular video perception tasks using clips, it may face limitations when handling long-term or high-order video tasks like anticipating plots from movie segments. Future work could focus on expanding the model's capabilities to address these challenges and further enhance the generality of video representation learning. Moreover, ethical considerations are paramount in data collection for training models like . Queries used for gathering data are carefully vetted for ethical and legal compliance to ensure responsible use of curated datasets. As research progresses in the field of video understanding, it is essential to explore issues related to bias, risks, fairness, equality, and other social topics for a comprehensive impact assessment. Overall,< kd>InternVideo</kd> represents a significant advancement in the realm of video foundation models by achieving superior performance across diverse tasks while maintaining efficiency in training. Its success underscores the potential for further innovation in enhancing video understanding capabilities and addressing broader societal implications within the field of AI research.
Created on 25 Jun. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.