Auto-scaling Vision Transformers without Training
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- The paper "Auto-scaling Vision Transformers without Training" by Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, and Denny Zhou addresses challenges in designing and scaling Vision Transformers (ViTs).
- Two main issues identified are the lack of efficient methods for designing and scaling ViTs and the high computational cost associated with training ViTs compared to convolutional networks.
- As-ViT is introduced as an auto-scaling framework for ViTs that automates the process of discovering and scaling up ViT architectures in a principled and efficient manner.
- As-ViT streamlines the process by automatically finding optimal architectures without extensive human intervention through a training-free search process.
- The approach adjusts widths and depths across different layers of the architecture to create a range of architectures with varying numbers of parameters within a single run.
- A progressive tokenization strategy is introduced to accelerate training and reduce costs by handling coarse tokenization during early training stages.
- As-ViT achieves impressive performance in image classification (83.5% top-1 accuracy on ImageNet-1k) and object detection tasks (52.7% mAP on COCO dataset) without manual intervention or scaling adjustments.
- The end-to-end model design and scaling process using As-ViT only requires 12 hours on a single V100 GPU.
- The work provides valuable insights into automated designing and scaling techniques for Vision Transformers without extensive manual crafting or costly training procedures.
- The code for As-ViT is openly available at https://github.com/VITA-Group/AsViT.
Authors: Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou
Abstract: This work targets automated designing and scaling of Vision Transformers (ViTs). The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart. To tackle these issues, we propose As-ViT, an auto-scaling framework for ViTs without training, which automatically discovers and scales up ViTs in an efficient and principled manner. Specifically, we first design a "seed" ViT topology by leveraging a training-free search process. This extremely fast search is fulfilled by a comprehensive study of ViT's network complexity, yielding a strong Kendall-tau correlation with ground-truth accuracies. Second, starting from the "seed" topology, we automate the scaling rule for ViTs by growing widths/depths to different ViT layers. This results in a series of architectures with different numbers of parameters in a single run. Finally, based on the observation that ViTs can tolerate coarse tokenization in early training stages, we propose a progressive tokenization strategy to train ViTs faster and cheaper. As a unified framework, As-ViT achieves strong performance on classification (83.5% top1 on ImageNet-1k) and detection (52.7% mAP on COCO) without any manual crafting nor scaling of ViT architectures: the end-to-end model design and scaling process cost only 12 hours on one V100 GPU. Our code is available at https://github.com/VITA-Group/AsViT.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.