ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

AI-generated keywords: Visual recognition ConvNeXt self-supervised learning masked autoencoders architectural enhancements

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Significant advancements in visual recognition in the early 2020s driven by improved architectures and representation learning frameworks
  • Emergence of modern ConvNets like ConvNeXt demonstrating impressive performance across various scenarios
  • Potential for enhancement through self-supervised learning techniques like masked autoencoders (MAE)
  • Introduction of a fully convolutional masked autoencoder framework and Global Response Normalization (GRN) layer to address limitations and promote enhanced inter-channel feature competition
  • Development of ConvNeXt V2 model family by integrating self-supervised learning techniques with architectural enhancements
  • Improved performance on recognition benchmarks including ImageNet classification, COCO detection, and ADE20K segmentation
  • Availability of pre-trained models in different sizes, ranging from an efficient 3.7M-parameter Atto model to a state-of-the-art 650M Huge model achieving high accuracy using publicly available training data
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie

Code and models available at https://github.com/facebookresearch/ConvNeXt-V2
License: CC BY-NC-ND 4.0

Abstract: Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.

Submitted to arXiv on 02 Jan. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2301.00808v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

The field of visual recognition has seen significant advancements in the early 2020s, driven by improved architectures and better representation learning frameworks. One notable development is the emergence of modern ConvNets, exemplified by ConvNeXt, which have demonstrated impressive performance across various scenarios. These models were initially designed for supervised learning using ImageNet labels but have shown potential for enhancement through self-supervised learning techniques like masked autoencoders (MAE). However, a key observation was made that simply combining these two approaches did not yield optimal results. To address this limitation, a new approach was proposed in the form of a fully convolutional masked autoencoder framework and a novel Global Response Normalization (GRN) layer. These additions were integrated into the ConvNeXt architecture to promote enhanced inter-channel feature competition. This co-design strategy of incorporating self-supervised learning techniques with architectural enhancements led to the development of a new model family known as ConvNeXt V2. This innovative model significantly improves the performance of pure ConvNets on various recognition benchmarks including ImageNet classification, COCO detection, and ADE20K segmentation. Furthermore, the researchers behind ConvNeXt V2 have made pre-trained models available in different sizes, ranging from an efficient 3.7M-parameter Atto model achieving 76.7% top-1 accuracy on ImageNet to a state-of-the-art 650M Huge model attaining an impressive 88.9% accuracy using only publicly available training data. In conclusion, "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders" represents a significant contribution to the field of visual recognition by showcasing how a thoughtful integration of self-supervised learning techniques and architectural improvements can lead to substantial performance gains in image classification, object detection, and semantic segmentation tasks.
Created on 16 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.