PyramidTNT: Improved Transformer-in-Transformer Baselines with Pyramid Architecture

AI-generated keywords: Computer Vision Transformer Networks Transformer-in-Transformer PyramidTNT Convolutional Stem

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Transformer networks have made significant strides in computer vision tasks.
  • The Transformer-in-Transformer (TNT) architecture is notable for its ability to extract both local and global representations using inner and outer transformers.
  • Researchers including Kai Han, Jianyuan Guo, Yehui Tang, and Yunhe Wang have introduced new TNT baselines with a pyramid architecture and a convolutional stem.
  • The "PyramidTNT" model is an improvement over the original TNT, establishing hierarchical representations and outperforming leading vision transformers like Swin Transformer.
  • The code for PyramidTNT will be available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch.
  • This work extends the "Transformer in Transformer" concept and highlights the ongoing evolution of transformer-based approaches in computer vision.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kai Han, Jianyuan Guo, Yehui Tang, Yunhe Wang

Tech Report. An extension of "Transformer in Transformer" (arXiv:2103.00112)

Abstract: Transformer networks have achieved great progress for computer vision tasks. Transformer-in-Transformer (TNT) architecture utilizes inner transformer and outer transformer to extract both local and global representations. In this work, we present new TNT baselines by introducing two advanced designs: 1) pyramid architecture, and 2) convolutional stem. The new "PyramidTNT" significantly improves the original TNT by establishing hierarchical representations. PyramidTNT achieves better performances than the previous state-of-the-art vision transformers such as Swin Transformer. We hope this new baseline will be helpful to the further research and application of vision transformer. Code will be available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch.

Submitted to arXiv on 04 Jan. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2201.00978v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the realm of computer vision tasks, Transformer networks have made significant strides. The Transformer-in-Transformer (TNT) architecture stands out for its ability to extract both local and global representations using inner and outer transformers. Building upon this foundation, a team of researchers including Kai Han, Jianyuan Guo, Yehui Tang, and Yunhe Wang have introduced new TNT baselines featuring two innovative designs: a pyramid architecture and a convolutional stem. The resulting "PyramidTNT" model represents a substantial improvement over the original TNT by establishing hierarchical representations. Notably, PyramidTNT has demonstrated superior performance compared to leading vision transformers like Swin Transformer. The researchers anticipate that this enhanced baseline will prove invaluable for advancing research and practical applications in the field of vision transformers. For those interested in exploring further, the code for PyramidTNT is set to be made available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch. This work serves as an extension of the "Transformer in Transformer" concept and underscores the ongoing evolution of transformer-based approaches in computer vision.
Created on 30 Dec. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.