Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

AI-generated keywords: Ferret-UI

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Ferret-UI is a novel multimodal large language model (MLLM) designed to enhance comprehension and interaction with user interface (UI) screens
  • Equipped with referring, grounding, and reasoning capabilities tailored for mobile UI screens
  • Incorporates "any resolution" to magnify details and leverage enhanced visual features
  • Training data includes elementary UI tasks such as icon recognition, finding text, and widget listing formatted with region annotations
  • Dataset for advanced tasks compiled to enhance reasoning ability
  • Outperforms most open-source UI MLLMs and surpasses GPT-4V on all elementary UI tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan

Abstract: Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.

Submitted to arXiv on 08 Apr. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2404.05719v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , In their paper titled "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs," authors Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, and Zhe Gan introduce Ferret-UI as a novel multimodal large language model (MLLM) specifically designed to enhance the comprehension and interaction capabilities with user interface (UI) screens. Existing general-domain MLLMs have shown remarkable progress in various tasks; however, they often struggle when it comes to effectively understanding and engaging with UI screens. To address this limitation, Ferret-UI is equipped with referring, grounding, and reasoning capabilities tailored for mobile UI screens. Recognizing that UI screens typically feature an elongated aspect ratio and contain smaller objects of interest like icons and text compared to natural images, the researchers incorporate "any resolution" into Ferret-UI to magnify details and leverage enhanced visual features. Each screen is divided into two sub-images based on the original aspect ratio - horizontal division for portrait screens and vertical division for landscape screens. These sub-images are encoded separately before being processed by the MLLMs. The training data for Ferret-UI is meticulously gathered from a wide range of elementary UI tasks such as icon recognition, finding text, and widget listing. These samples are formatted with region annotations to facilitate precise referring and grounding during instruction-following processes. Additionally, a dataset for advanced tasks including detailed descriptions, perception/interaction conversations, and function inference is compiled to enhance the model's reasoning ability. After training on these curated datasets, demonstrates exceptional comprehension of and the capability to execute open-ended instructions effectively. To evaluate its performance comprehensively across all mentioned tasks, the researchers establish a where Ferret-UI not only outperforms most open-source UI MLLMs but also surpasses GPT-4V on all elementary UI tasks. This innovative approach towards grounded mobile UI understanding showcases significant advancements in tailored specifically for interacting with user interface screens efficiently.
Created on 09 Apr. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.