ViTPose++: Vision Transformer for Generic Body Pose Estimation
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Authors Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao explore the capabilities of plain vision transformers in body pose estimation
- ViTPose model leverages simplicity, scalability, flexibility, and transferability of knowledge within vision transformers
- ViTPose utilizes non-hierarchical vision transformer as encoder and lightweight decoder for top-down or bottom-up decoding
- Model scalable from 20M to 1B parameters while maintaining high performance and efficiency
- ViTPose demonstrates flexibility in attention type, input resolution, pre-training strategies, and fine-tuning methods
- Introduction of ViTPose+ extension addresses heterogeneous body keypoint categories through knowledge factorization with task-specific feed-forward networks
- Empirical evidence supports ease of transferring knowledge from large ViTPose models to smaller ones using a simple knowledge token
- Experimental results show superior performance of ViTPose on challenging benchmarks like MS COCO Human Keypoint Detection at both top-down and bottom-up settings
- ViTPose+ achieves state-of-the-art results across various body pose estimation tasks without compromising on inference speed
Authors: Yufei Xu, Jing Zhang, Qiming Zhang, Dacheng Tao
Abstract: In this paper, we show the surprisingly good properties of plain vision transformers for body pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model dubbed ViTPose. Specifically, ViTPose employs the plain and non-hierarchical vision transformer as an encoder to encode features and a lightweight decoder to decode body keypoints in either a top-down or a bottom-up manner. It can be scaled up from about 20M to 1B parameters by taking advantage of the scalable model capacity and high parallelism of the vision transformer, setting a new Pareto front for throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, and pre-training and fine-tuning strategy. Based on the flexibility, a novel ViTPose+ model is proposed to deal with heterogeneous body keypoint categories in different types of body pose estimation tasks via knowledge factorization, i.e., adopting task-agnostic and task-specific feed-forward networks in the transformer. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our ViTPose model outperforms representative methods on the challenging MS COCO Human Keypoint Detection benchmark at both top-down and bottom-up settings. Furthermore, our ViTPose+ model achieves state-of-the-art performance simultaneously on a series of body pose estimation tasks, including MS COCO, AI Challenger, OCHuman, MPII for human keypoint detection, COCO-Wholebody for whole-body keypoint detection, as well as AP-10K and APT-36K for animal keypoint detection, without sacrificing inference speed.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.