Global Context Vision Transformers

AI-generated keywords: GC ViT Self-Attention Inverted Residual Block Parameter Efficiency Computer Vision

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • The paper proposes a novel architecture called GC ViT for computer vision tasks
  • GC ViT enhances parameter and compute utilization
  • The model consists of global context self-attention modules combined with standard local self-attention to effectively model both long and short-range spatial interactions
  • The proposed downsampler leverages a parameter-efficient fused inverted residual block to improve the modeling of inter-channel dependencies
  • GC ViT achieves new state-of-the-art performance across image classification, object detection, and semantic segmentation tasks
  • Pre-trained GC ViT backbones outperform prior work consistently by large margins in downstream tasks of object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets
  • The success of this proposed architecture can be attributed to its ability to enhance parameter efficiency while maintaining high accuracy levels across various computer vision tasks.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ali Hatamizadeh, Hongxu Yin, Jan Kautz, Pavlo Molchanov

Tech report
License: CC BY-NC-ND 4.0

Abstract: We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision tasks. The core of the novel model are global context self-attention modules, joint with standard local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, as an alternative to complex operations such as an attention masks or local windows shifting. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and values. In addition, we address the lack of inductive bias in ViTs and improve the modeling of inter-channel dependencies by proposing a novel downsampler which leverages a parameter-efficient fused inverted residual block. The proposed GC ViT achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, GC ViT models with 51M, 90M and 201M parameters achieve 84.3%, 84.9% and 85.6% Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer. Pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation, and semantic segmentation on MS COCO and ADE20K datasets outperform prior work consistently, sometimes by large margins.

Submitted to arXiv on 20 Jun. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2206.09959v4

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their paper titled "Global Context Vision Transformers," Ali Hatamizadeh, Hongxu Yin, Jan Kautz, and Pavlo Molchanov propose a novel architecture called GC ViT that enhances parameter and compute utilization for computer vision tasks. The core of the model consists of global context self-attention modules combined with standard local self-attention to effectively model both long and short-range spatial interactions. This approach is an alternative to complex operations such as attention masks or local windows shifting. While the local self-attention modules are responsible for modeling short-range information, the global query tokens are shared across all global self-attention modules to interact with local key and values. The authors also address the lack of inductive bias in ViTs by proposing a novel downsampler that leverages a parameter-efficient fused inverted residual block. This improves the modeling of inter-channel dependencies. The proposed GC ViT achieves new state-of-the-art performance across image classification, object detection, and semantic segmentation tasks. On ImageNet-1K dataset for classification, GC ViT models with 51M, 90M and 201M parameters achieve 84.3%, 84.9% and 85.6% Top-1 accuracy respectively, surpassing comparably sized prior art such as CNN based ConvNeXt and ViT based Swin Transformer. Moreover, pre-trained GC ViT backbones in downstream tasks of object detection, instance segmentation and semantic segmentation on MS COCO and ADE20K datasets outperform prior work consistently by large margins. The success of this proposed architecture can be attributed to its ability to enhance parameter efficiency while maintaining high accuracy levels across various computer vision tasks. Overall, this research provides valuable insights into improving transformer based models' performance in computer vision applications.
Created on 09 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.