F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

AI-generated keywords: F-VLM Object Detection Frozen Vision and Language Models Open-Vocabulary Knowledge Distillation

AI-generated Key Points

  • F-VLM is a novel object detection method that leverages Frozen Vision and Language Models (VLM)
  • Simplifies traditional multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining
  • Fine-tunes only the detector head while combining outputs of detector and VLM for each region during inference
  • Achieves a remarkable +6.5 mask Average Precision (AP) improvement over previous state-of-the-art on LVIS open-vocabulary detection benchmark
  • Competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection tasks
  • Ablation studies conducted on backbone fine-tuning, score fusion design/parameters, feature pyramid capacity, and background weight
  • Fine-tuning enhances standard detection for base categories but slightly hampers open-vocabulary detection for novel categories
  • Effectiveness of using geometric mean over arithmetic mean in score fusion (+8 APr)
  • Offers promising approach to object detection by leveraging frozen VLMs effectively
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

Accepted to ICLR 2023 (https://iclr.cc/Conferences/2023). 20 pages, 7 figures
License: CC BY 4.0

Abstract: We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of the art on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released at the https://sites.google.com/view/f-vlm/home

Submitted to arXiv on 30 Sep. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2209.15639v2

In their paper titled "F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models," Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova introduce F-VLM - a novel object detection method that leverages Frozen Vision and Language Models (VLM). This approach simplifies the traditional multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. The authors make an intriguing observation that a frozen VLM retains crucial locality-sensitive features required for detection and serves as a robust region classifier. The key innovation of F-VLM lies in fine-tuning only the detector head while combining the outputs of the detector and VLM for each region during inference. This approach demonstrates compelling scaling behavior and achieves a remarkable +6.5 mask Average Precision (AP) improvement over the previous state-of-the-art on novel categories of the LVIS open-vocabulary detection benchmark. Additionally, competitive results are showcased on the COCO open-vocabulary detection benchmark and cross-dataset transfer detection tasks. The authors provide insights into their experiments through ablation studies on backbone fine-tuning, score fusion design/parameters, feature pyramid capacity, and background weight. Notably, they find that while fine-tuning enhances standard detection for base categories, it slightly hampers open-vocabulary detection for novel categories. The study also highlights the effectiveness of using geometric mean over arithmetic mean in score fusion (+8 APr), emphasizing the importance of balancing high-scoring regions in both detection and VLM scores simultaneously. Overall,F-VLM presents a promising approach to object detection by leveraging frozen VLMs effectively. The findings not only contribute to advancements in open-vocabulary object detection but also offer insights into potential strategies for further improving performance in this domain. The authors plan to release their code for broader research community access at https://sites.google.com/view/f-vlm/home.
Created on 13 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.