Describe Anything: Detailed Localized Image and Video Captioning

AI-generated keywords: Vision-Language Models

AI-generated Key Points

  • The Describe Anything Model (DAM) was introduced for Detailed Localized Captioning (DLC), balancing local details with global context in image and video captions.
  • DLC-SDP, a Semi-supervised Learning-based Data Pipeline, improved DLC data quality by leveraging segmentation datasets and unlabeled web images.
  • DLC-Bench was introduced as a benchmark for evaluating DLC without relying on reference captions, using an attribute-based evaluation approach.
  • DAM achieved state-of-the-art performance across 7 benchmarks for regional captioning granularity.
  • Off-the-shelf Vision-Language Models (VLMs) like GPT-4o and LLaVA excel at global-level descriptions but struggle with detailed localized captions due to limitations in specifying regions accurately.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui

Project page: https://describe-anything.github.io/
License: CC BY 4.0

Abstract: Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

Submitted to arXiv on 22 Apr. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.16072v1

, , , , The realm of vision-language models has long struggled with generating accurate and detailed descriptions for specific regions in images and videos. To address this challenge, the Describe Anything Model (DAM) was introduced as a specialized model for Detailed Localized Captioning (DLC). By incorporating a focal prompt and a localized vision backbone, DAM effectively balances local details with global context in image and video captions. Additionally, the development of DLC-SDP, a Semi-supervised Learning-based Data Pipeline, has improved the quality of DLC data by leveraging segmentation datasets and unlabeled web images. The introduction of DLC-Bench as a benchmark for evaluating DLC without relying on reference captions marked a significant advancement in the field. The attribute-based evaluation approach used in DLC-Bench overcomes limitations associated with reference-based scoring methods. Notably, DAM has achieved state-of-the-art performance across 7 benchmarks that cover various levels of granularity in regional captioning. The challenges associated with generating detailed localized descriptions using off-the-shelf Vision-Language Models (VLMs) were also discussed. While VLMs like GPT-4o and LLaVA excel at producing global-level image descriptions, they struggle with providing detailed localized captions due to limitations in specifying regions of interest accurately. Various approaches were explored to address this issue, including presenting only the region to the VLM through masking or cropping and overlaying markings for localization cues. Overall, the advancements made through DAM, DLC-SDP, and DLC-Bench have significantly contributed to enhancing the capabilities of vision-language models in generating detailed localized image and video captions. The discussions presented shed light on the challenges faced in this domain and highlight potential avenues for further research and development.
Created on 17 Nov. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.