PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

AI-generated keywords: Computer Vision Transformer Networks Transformer-in-Transformer PyramidTNT Convolutional Stem

AI-generated Key Points

⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

Transformer networks have made significant strides in computer vision tasks.
The Transformer-in-Transformer (TNT) architecture is notable for its ability to extract both local and global representations using inner and outer transformers.
Researchers including Kai Han, Jianyuan Guo, Yehui Tang, and Yunhe Wang have introduced new TNT baselines with a pyramid architecture and a convolutional stem.
The "PyramidTNT" model is an improvement over the original TNT, establishing hierarchical representations and outperforming leading vision transformers like Swin Transformer.
The code for PyramidTNT will be available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch.
This work extends the "Transformer in Transformer" concept and highlights the ongoing evolution of transformer-based approaches in computer vision.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kai Han, Jianyuan Guo, Yehui Tang, Yunhe Wang

arXiv: 2201.00978v1 - DOI (cs.CV)

Tech Report. An extension of "Transformer in Transformer" (arXiv:2103.00112)

License: NONEXCLUSIVE-DISTRIB 1.0

Abstract: Transformer networks have achieved great progress for computer vision tasks. Transformer-in-Transformer (TNT) architecture utilizes inner transformer and outer transformer to extract both local and global representations. In this work, we present new TNT baselines by introducing two advanced designs: 1) pyramid architecture, and 2) convolutional stem. The new "PyramidTNT" significantly improves the original TNT by establishing hierarchical representations. PyramidTNT achieves better performances than the previous state-of-the-art vision transformers such as Swin Transformer. We hope this new baseline will be helpful to the further research and application of vision transformer. Code will be available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch.

Submitted to arXiv on 04 Jan. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2201.00978v1

⚠This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Comprehensive Summary
Key points
Layman's Summary
Blog article

In the realm of computer vision tasks, Transformer networks have made significant strides. The Transformer-in-Transformer (TNT) architecture stands out for its ability to extract both local and global representations using inner and outer transformers. Building upon this foundation, a team of researchers including Kai Han, Jianyuan Guo, Yehui Tang, and Yunhe Wang have introduced new TNT baselines featuring two innovative designs: a pyramid architecture and a convolutional stem. The resulting "PyramidTNT" model represents a substantial improvement over the original TNT by establishing hierarchical representations. Notably, PyramidTNT has demonstrated superior performance compared to leading vision transformers like Swin Transformer. The researchers anticipate that this enhanced baseline will prove invaluable for advancing research and practical applications in the field of vision transformers. For those interested in exploring further, the code for PyramidTNT is set to be made available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch. This work serves as an extension of the "Transformer in Transformer" concept and underscores the ongoing evolution of transformer-based approaches in computer vision.

- Transformer networks have made significant strides in computer vision tasks.
- The Transformer-in-Transformer (TNT) architecture is notable for its ability to extract both local and global representations using inner and outer transformers.
- Researchers including Kai Han, Jianyuan Guo, Yehui Tang, and Yunhe Wang have introduced new TNT baselines with a pyramid architecture and a convolutional stem.
- The "PyramidTNT" model is an improvement over the original TNT, establishing hierarchical representations and outperforming leading vision transformers like Swin Transformer.
- The code for PyramidTNT will be available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch.
- This work extends the "Transformer in Transformer" concept and highlights the ongoing evolution of transformer-based approaches in computer vision.

SummaryTransformer networks are advanced tools that help computers see better. The Transformer-in-Transformer (TNT) design is special because it can understand both small and big details using inner and outer transformers. Some smart people named Kai Han, Jianyuan Guo, Yehui Tang, and Yunhe Wang have created new TNT models with a pyramid shape and a special starting point called a convolutional stem. One of these models, called "PyramidTNT," is even better than the original TNT and beats other top transformer models like Swin Transformer in computer vision tasks. You can find the code for PyramidTNT on a website to try it out yourself. Definitions- Transformer networks: Advanced tools that help computers process visual information. - Architecture: The way different parts of something are organized or designed. - Representation: A way to show or describe something. - Baselines: Starting points or reference levels used for comparison. - Hierarchical: Arranged in levels like a pyramid, with some parts being more important than others. - GitHub: A website where people share and collaborate on coding projects.

In recent years, transformer networks have emerged as a powerful tool in the field of computer vision. These networks, originally designed for natural language processing tasks, have shown great potential in handling visual data and have achieved state-of-the-art results in various computer vision tasks. Among these transformer-based architectures, Transformer-in-Transformer (TNT) has stood out for its ability to extract both local and global representations using inner and outer transformers. A team of researchers including Kai Han, Jianyuan Guo, Yehui Tang, and Yunhe Wang has recently introduced new TNT baselines featuring two innovative designs: a pyramid architecture and a convolutional stem. This work serves as an extension of the "Transformer in Transformer" concept and underscores the ongoing evolution of transformer-based approaches in computer vision. The original TNT model was proposed by Han et al. in their paper "Transformers with convolutional context for efficient image recognition". It introduced the concept of using inner transformers to capture local features within an image while outer transformers were used to capture global information. This approach proved effective but had limitations when it came to handling hierarchical structures within images. To address this issue, the team proposed a new architecture called PyramidTNT which incorporates both inner and outer transformers into a pyramid structure. The resulting model is able to establish hierarchical representations by capturing features at different scales within an image. This allows PyramidTNT to better handle complex images with multiple levels of detail. Furthermore, the researchers also introduced a convolutional stem design which replaces the initial linear projection layer with a series of convolutional layers. This modification not only reduces computational costs but also improves feature extraction capabilities compared to the original TNT model. To evaluate their proposed architecture, the team conducted experiments on several benchmark datasets including ImageNet-1k, CIFAR-100, COCO object detection dataset and ADE20K semantic segmentation dataset. Their results showed that PyramidTNT outperformed other leading vision transformers such as Swin Transformer, DeiT and ViT in terms of accuracy and efficiency. In addition to achieving state-of-the-art results, the researchers also demonstrated the versatility of PyramidTNT by applying it to various downstream tasks such as object detection and semantic segmentation. The model showed promising performance in these tasks, further highlighting its potential for practical applications. The team plans to release the code for PyramidTNT on GitHub (https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch), making it accessible for other researchers to use and build upon. This will facilitate further advancements in transformer-based approaches for computer vision tasks. In conclusion, the introduction of PyramidTNT represents a significant improvement over the original TNT model and showcases the continuous evolution of transformer-based architectures in computer vision. With its ability to capture hierarchical representations and achieve state-of-the-art results, this new baseline is expected to have a significant impact on future research and practical applications in this field.

Created on 30 Dec. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

69.2%

Transformer in Transformer

cs.CV

62.8%

Transformer-Based UNet with Multi-Headed Cross-Attention Skip Connections to …

cs.CV

62.6%

Simple Open-Vocabulary Object Detection with Vision Transformers

cs.CV

62.4%

Training Vision Transformers for Image Retrieval

cs.CV

61.8%

Improved Multiscale Vision Transformers for Classification and Detection

cs.CV

61.4%

Graph Stacked Hourglass Networks for 3D Human Pose Estimation

cs.CV

61.2%

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

cs.CV

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.