General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

AI-generated keywords: General OCR Theory Optical Character Recognition GOT model data collection diverse tasks

AI-generated Key Points

  • Authors propose the General OCR Theory (GOT) and model for advancing OCR to OCR-2.0
  • GOT model designed to handle various types of artificial optical signals, including plain texts, formulas, tables, charts, sheet music, and shapes
  • GOT model has 580M parameters, high-compression encoder, long-contexts decoder for superior performance
  • Authors gathered diverse data sources using six rendering tools like LATEX and Mathpix-markdown-it for training and testing
  • Model's capabilities enhanced through challenging tasks like sheet music recognition and chart analysis using datasets like GrandStaff
  • Comprehensive approach to data collection and model development demonstrates effectiveness and versatility of GOT model in handling OCR tasks with superior performance
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

License: CC BY-SA 4.0

Abstract: Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.

Submitted to arXiv on 03 Sep. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2409.01704v1

The authors propose the General OCR Theory (GOT) and an associated model to advance the field of Optical Character Recognition (OCR) from traditional systems to OCR-2.0. The GOT model is designed to handle various types of artificial optical signals, or "characters," including plain texts, math/molecular formulas, tables, charts, sheet music, and geometric shapes. With 580M parameters and a high-compression encoder and long-contexts decoder for end-to-end processing, the GOT model offers superior performance compared to traditional OCR systems. To enhance its capabilities, the authors have gathered diverse data sources for training and testing purposes using six rendering tools such as LATEX for tables and Mathpix-markdown-it for math/molecular formulas. They also aim to expand the model's capabilities by collecting challenging tasks like sheet music recognition and chart analysis through datasets like GrandStaff. Through their comprehensive approach to data collection and model development, the authors demonstrate the effectiveness and versatility of their GOT model in handling a wide range of optical character recognition tasks with superior performance compared to traditional OCR systems.
Created on 11 Sep. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.