Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- State space models (SSMs) with efficient hardware-aware designs, like Mamba, have shown promise in long sequence modeling
- Vision Mamba (Vim) is a new approach that builds efficient and generic vision backbones using SSMs
- Vim incorporates bidirectional Mamba blocks into the architecture instead of relying solely on self-attention for visual representation learning
- Vim marks image sequences with position embeddings and compresses visual representations using bidirectional state space models
- Vim achieves higher performance compared to well-established vision transformers like DeiT on tasks such as ImageNet classification, COCO object detection, and ADE20k semantic segmentation
- Vim demonstrates significantly improved computation and memory efficiency, being 2.8 times faster than DeiT and saving 86.8% GPU memory during batch inference on high-resolution images (1248x1248 resolution)
- Vim overcomes computation and memory constraints while performing Transformer-style understanding for high-resolution images
- Code for implementing Vim can be found at https://github.com/hustvl/Vim
- Vim offers a promising solution for efficient visual representation learning in computer vision tasks and has the potential to become the next-generation backbone for vision foundation models
Authors: Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang
Abstract: Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., Mamba, have shown great potential for long sequence modeling. Building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance of visual representation learning on self-attention is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8$\times$ faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248$\times$1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to become the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.