Detect Every Thing with Few Examples

AI-generated keywords: Open-set object detection DE-ViT Vision-language backbones Binary classification tasks Region propagation technique

AI-generated Key Points

  • Open-set object detection goal: detect objects from unseen categories
  • Recent advancements focus on open-vocabulary paradigm using vision-language backbones
  • DE-ViT: new approach using vision-only DINOv2 backbones and example images for learning new categories
  • DE-ViT transforms multi-classification tasks into binary classification tasks for more efficient detection
  • Introduces novel region propagation technique for localization
  • Performance evaluation on COCO and LVIS datasets:
  • Outperforms state-of-the-art open-vocabulary method in COCO by 6.9 AP50, achieves 50 AP50 in novel classes
  • Surpasses state-of-the-art few-shot method by 15 mAP on 10-shot and 7.2 mAP on 30-shot, and one-shot SoTA by 2.8 AP50 in COCO
  • Outperforms state-of-the-art open-vocabulary method in LVIS by 2.2 mask AP, reaches a mask APr of 34.3
  • Code available at https://github.com/mlzxy/devit
  • Ablation studies conducted to analyze different aspects of DE-ViT's performance:
  • Comparisons of different classification architectures and annotation types used to build prototypes
  • Feature visualization comparing DINOv2 visual features with CLIP text features, suggesting images may be more promising than texts for representing classes.
  • Overall, DE-ViT presents a novel approach achieving superior performance compared to existing methods
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xinyu Zhang, Yuting Wang, Abdeslam Boularias

License: CC BY 4.0

Abstract: Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at https://github.com/mlzxy/devit.

Submitted to arXiv on 22 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.12969v1

In the field of open-set object detection, the goal is to detect objects that belong to categories not seen during training. Recent advancements in this area have focused on the open-vocabulary paradigm, which involves using vision-language backbones to represent categories with language. However, in this paper, the authors introduce a new approach called DE-ViT. <br> DE-ViT is an open-set object detector that utilizes vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve the general detection ability of DE-ViT, the authors propose transforming multi-classification tasks into binary classification tasks. This bypasses per-class inference and allows for more efficient detection. Additionally, they introduce a novel region propagation technique for localization.<br> The performance of DE-ViT is evaluated on various benchmarks including COCO and LVIS datasets. In terms of COCO, DE-ViT outperforms the state-of-the-art (SoTA) open-vocabulary method by 6.9 AP50 and achieves 50 AP50 in novel classes. It also surpasses the SoTA few-shot method by 15 mAP on 10-shot and 7.2 mAP on 30-shot, as well as the one-shot SoTA by 2.8 AP50.<br> For LVIS, DE-ViT outperforms the SoTA open-vocabulary method by 2.2 mask AP and reaches a mask APr of 34.3.<br> The authors provide code for DE-ViT at https://github.com/mlzxy/devit.<br> In addition to these results, ablation studies are conducted to analyze different aspects of DE-ViT's performance. These studies include comparisons of different classification architectures and annotation types used to build prototypes. Furthermore, feature visualization is performed to compare DINOv2 visual features with CLIP text features. The visualization shows that while CLIP visual features exhibit excellent intra-class compactness, there is a significant gap between CLIP text and visual features. This suggests that using images to represent classes may be more promising than relying solely on texts.<br> Overall, DE-ViT presents a novel approach to open-set object detection that achieves superior performance compared to existing methods.
Created on 10 Jan. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.