ScreenAI: A Vision-Language Model for UI and Infographics Understanding

AI-generated keywords: Digital Content Understanding

AI-generated Key Points

  • Infographics and user interfaces (UIs) are crucial for effective communication and human-machine interaction in the realm of digital content understanding.
  • Infographics distill complex information into visually appealing formats such as charts, diagrams, maps, and tables.
  • UIs on mobile and desktop platforms enable rich interactive experiences through design principles and visual language.
  • ScreenAI is a Vision-Language Model (VLM) developed to comprehend both UIs and infographics by leveraging the PaLI architecture with Pix2struct patching mechanism.
  • Key contributions of ScreenAI include introducing textual representation for UIs during pretraining, generating training data at scale with Large Language Models (LLMs), covering a wide range of tasks in UI and infographic understanding, and releasing evaluation datasets for comprehensive benchmarking.
  • With 4.6 billion parameters as of January 17th, 20241, ScreenAI showcases state-of-the-art performance on public infographic QA benchmarks while being more efficient than larger models.
  • The model's refined architecture features an image encoder followed by a multimodal encoder that processes embedded text and image features before generating final text output through an autoregressive decoder.
  • ScreenAI's innovative design choices and superior performance position it as a leading solution for diverse digital content understanding challenges in UIs, infographics, and beyond within the artificial intelligence landscape.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Cărbune, Jason Lin, Jindong Chen, Abhanshu Sharma

Accepted to International Joint Conference on Artificial Intelligence (IJCAI), 2024. Revision Notes: full version of the paper, including 1) Camera-ready version for IJCAI-24; 2) Appendices that are mentioned, but not included in 1)
License: CC BY 4.0

Abstract: Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

Submitted to arXiv on 07 Feb. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2402.04615v3

, , , , In the realm of digital content understanding, infographics and user interfaces (UIs) serve as vital tools for effective communication and human-machine interaction. Infographics, encompassing charts, diagrams, maps, and tables, distill complex information into visually appealing formats. Similarly, UIs on mobile and desktop platforms facilitate rich interactive experiences through their design principles and visual language. Recognizing the shared visual elements between infographics and UIs, there is a need for a unified model that can comprehend both domains. This challenge led to the development of ScreenAI, a Vision-Language Model (VLM) tailored for comprehensive understanding of UIs and infographics. By leveraging the PaLI architecture with the patching mechanism of Pix2struct, ScreenAI tackles tasks such as question-answering (QA), element annotation, summarization, navigation, and more on these visual mediums. The key contributions of ScreenAI lie in its holistic approach to digital content understanding: 1. Introducing a textual representation for UIs during pretraining to enhance model comprehension. 2. Leveraging this representation with Large Language Models (LLMs) to generate training data at scale. 3. Defining pretraining and fine-tuning mixtures covering a wide range of tasks in UI and infographic understanding. 4. Releasing three evaluation datasets - Screen Annotation, ScreenQA Short, and Complex ScreenQA - enabling comprehensive benchmarking of models for screen-based QA. These advancements position ScreenAI as a leading VLM for various digital content understanding tasks across UIs and infographics. With just 4.6 billion parameters as of January 17th, 20241, the model showcases state-of-the-art performance on public infographic QA benchmarks while outperforming larger models by significant margins. Its versatility makes it an ideal choice for researchers and practitioners seeking top-tier performance in digital content analysis. Furthermore, the refined architecture of ScreenAI features an image encoder followed by a multimodal encoder that processes embedded text and image features before generating final text output through an autoregressive decoder. The incorporation of pix2struct patching ensures adaptability to different aspect ratios and shapes within the visual data. Overall, ScreenAI's innovative design choices and superior performance underscore its potential as a go-to solution for diverse digital content understanding challenges in UIs, infographics, and beyond within the artificial intelligence landscape.
Created on 05 Dec. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.