Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

AI-generated keywords: Visually-Situated Language Understanding Pix2Struct Pretraining Variable-Resolution Input Representation Fine-Tuning Strategies

AI-generated Key Points

  • Introduction of general-purpose visually-situated language understanding
  • Proposal of Pix2Struct, a pretrained image-to-text model for visually-situated language tasks
  • Subsuming pretraining signals such as OCR, language modeling, and image captioning
  • Pretraining through parsing masked screenshots of web pages into simplified HTML
  • Variable-resolution input representation and flexible integration of language and vision inputs
  • State-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images
  • Major contributions: introducing general-purpose visually-situated language understanding; screenshot parsing pretraining objective; variable resolution input representations; new fine-tuning strategies for integration of language and vision inputs
  • Encouragement for further development in the intersection of language and vision.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova

Accepted at ICML
License: CC BY 4.0

Abstract: Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to this diversity, previous work has typically relied on domain-specific recipes with limited sharing of the underlying data, model architectures, and objectives. We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding, which can be finetuned on tasks containing visually-situated language. Pix2Struct is pretrained by learning to parse masked screenshots of web pages into simplified HTML. The web, with its richness of visual elements cleanly reflected in the HTML structure, provides a large source of pretraining data well suited to the diversity of downstream tasks. Intuitively, this objective subsumes common pretraining signals such as OCR, language modeling, image captioning. In addition to the novel pretraining strategy, we introduce a variable-resolution input representation and a more flexible integration of language and vision inputs, where language prompts such as questions are rendered directly on top of the input image. For the first time, we show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images.

Submitted to arXiv on 07 Oct. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.03347v2

The paper introduces the concept of general-purpose visually-situated language understanding, which involves various tasks but shares common challenges. The authors propose Pix2Struct, a pretrained image-to-text model that can be fine-tuned for tasks involving visually-situated language. This approach subsumes pretraining signals such as OCR, language modeling, and image captioning. The model is pretrained by learning to parse masked screenshots of web pages into simplified HTML, leveraging the rich visual elements reflected in the HTML structure. The authors also introduce a variable-resolution input representation and a flexible integration of language and vision inputs, where text prompts are rendered directly on top of the input image. The paper demonstrates that Pix2Struct achieves state-of-the-art results in six out of nine tasks across four domains: documents, illustrations, user interfaces, and natural images. The major contributions of the paper include introducing general-purpose visually-situated language understanding; proposing a screenshot parsing pretraining objective based on web page HTML source; and introducing variable resolution input representations and new fine tuning strategies for seamless integration of language and vision inputs. The authors hope that these results will encourage further development of general purpose methods in the intersection of language and vision.
Created on 05 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.