Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks

AI-generated keywords: Florence-2 vision foundation model FLD-5B dataset visual understanding computer vision challenges

AI-generated Key Points

  • Introduction of Florence-2, a cutting-edge vision foundation model
  • Utilizes prompt-based representation for tasks like captioning, object detection, grounding, and segmentation
  • Development of FLD-5B dataset with 5.4 billion visual annotations on 126 million images
  • Dataset includes around 500 million text annotations and approximately 1.3 billion region-text annotations
  • Annotations cover multiple spatial hierarchies and semantic granularities
  • Florence-2's unified architecture enables handling diverse levels of granularity seamlessly
  • Strong contender in vision foundation models with zero-shot learning capabilities and fine-tuning performance
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan

License: CC BY 4.0

Abstract: We introduce Florence-2, a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. While existing large vision models excel in transfer learning, they struggle to perform a diversity of tasks with simple instructions, a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms, whether it be captioning, object detection, grounding or segmentation. This multi-task learning setup demands large-scale, high-quality annotated data. To this end, we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities.

Submitted to arXiv on 10 Nov. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.06242v1

In this study, we introduce Florence-2, a cutting-edge vision foundation model designed to excel in a wide range of computer vision and vision-language tasks. Unlike existing large vision models that struggle with simple task instructions, Florence-2 utilizes a prompt-based representation to generate text results for tasks such as captioning, object detection, grounding, and segmentation. To train Florence-2 effectively, we developed the FLD-5B dataset containing 5.4 billion visual annotations on 126 million images. This dataset surpasses previous ones in terms of annotation quantity and diversity across spatial hierarchy and semantic granularity levels. Our analysis reveals that FLD-5B includes around 500 million text annotations with varying levels of detail, providing rich information for comprehensive visual understanding. Additionally, the dataset features approximately 1.3 billion region-text annotations, significantly larger than other object detection datasets. Each image in FLD-5B is annotated with text, region-text pairs, and text-phrase-region triplets to cover multiple spatial hierarchies and semantic granularities. Furthermore, we highlight the complexity of visual data in computer vision tasks such as object location and attributes. Achieving universal representation requires adept management of intricate tasks organized along spatial hierarchy and semantic granularity dimensions. Florence-2's unified architecture enables it to handle diverse levels of granularity and transition seamlessly from high-level captions to nuanced descriptions for versatile applications. Overall, our study showcases Florence-2 as a strong contender in the field of vision foundation models with unprecedented zero-shot learning capabilities and fine-tuning performance across various tasks.
Created on 11 Jul. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.