MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding

AI-generated keywords: Visually-rich Document Understanding

AI-generated Key Points

  • Significant progress in Visually-Rich Document Understanding (VrDU) through multimodal pre-training techniques
  • Introduction of the MarkupLM model for document understanding tasks involving markup languages like HTML/XML-based documents
  • Outperformance of existing baseline models by MarkupLM on various VrDU tasks
  • Importance of leveraging markup structures for document-level pre-training in VrDU tasks, especially for markup-language-based documents
  • Potential expansion of MarkupLM to digital-born PDFs and Office documents, exploring synergies with LayoutLM under multi-view settings
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Junlong Li, Yiheng Xu, Lei Cui, Furu Wei

Work in Progress
License: CC BY-NC-SA 4.0

Abstract: Multimodal pre-training with text, layout, and image has made significant progress for Visually-rich Document Understanding (VrDU), especially the fixed-layout documents such as scanned document images. While, there are still a large number of digital documents where the layout information is not fixed and needs to be interactively and dynamically rendered for visualization, making existing layout-based pre-training approaches not easy to apply. In this paper, we propose MarkupLM for document understanding tasks with markup languages as the backbone such as HTML/XML-based documents, where text and markup information is jointly pre-trained. Experiment results show that the pre-trained MarkupLM significantly outperforms the existing strong baseline models on several document understanding tasks. The pre-trained model and code will be publicly available at https://aka.ms/markuplm.

Submitted to arXiv on 16 Oct. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2110.08518v1

, , , , Significant Progress in Visually-Rich Document Understanding: The Role of Multimodal Pre-Training and the MarkupLM Model In recent years, there has been significant progress in the field of Visually-Rich Document Understanding (VrDU), particularly through the use of multimodal pre-training techniques that incorporate text, layout, and image information. While these approaches have proven successful for fixed-layout documents like scanned images, there remains a challenge in dynamically rendering layout information for interactive visualization in digital documents. To address this gap, the MarkupLM model has been proposed for document understanding tasks involving markup languages such as HTML/XML-based documents. By jointly pre-training text and markup information, MarkupLM outperforms existing baseline models on various VrDU tasks. This is particularly important as visually-rich documents can be categorized into two types: fixed-layout and markup-language-based. Fixed-layout documents already have pre-rendered layout and style information, making them suitable for existing pre-training methods. However, markup-language-based documents require dynamic rendering of layout and style information based on the device used. This difference highlights the need to leverage markup structures for document-level pre-training in VrDU tasks. MarkupLM addresses this need by integrating text and markup language pre-training within a single framework using Transformer architecture. New pre-training objectives tailored for understanding markup language enhance model performance on datasets like WebSRC and SWDE. In future research, there is potential to expand MarkupLM to digital-born PDFs and Office documents while exploring synergies between LayoutLM and MarkupLM under multi-view settings. To fine-tune MarkupLM for downstream tasks such as reading comprehension and information extraction, common practices from pre-trained language models are utilized. This involves leveraging the last hidden states of tokens for binary or linear classification layers. Overall, the development of MarkupLM represents a promising advancement in enhancing document understanding capabilities through effective integration of text and markup language pre-training methodologies.
Created on 27 Mar. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.