Enhancing OCR Performance through Post-OCR Models: Adopting Glyph Embedding for Improved Correction

AI-generated keywords: OCR models

AI-generated Key Points

  • Study focuses on potential of post-OCR models to overcome limitations in OCR models
  • Explores impact of incorporating glyph embedding on post-OCR correction performance
  • Researchers developed their own post-OCR correction model utilizing CharBERT and unique technique for capturing visual characteristics of characters
  • Two datasets used: ICDAR 2013 dataset and ICDAR 2023 dataset
  • ICDAR 2013 dataset serves as benchmark for evaluating OCR systems, includes diverse document images with printed and handwritten text captured under various conditions
  • ICDAR 2023 dataset introduces additional challenges such as intricate layouts, degraded text, low-resolution images, and background clutter
  • Three OCR models evaluated: EasyOCR, PaddleOCR, TrOCR
  • EasyOCR used for single word detection box functionality, PaddleOCR and TrOCR provide single line detection box functionality
  • Training of glyph embedding relies on Chars74K dataset which includes scene images of English and Kannada characters (only English alphabets utilized)
  • Screenshots of Korean and Hebrew characters captured for garbage class in open-set classification
  • State-of-the-art (SOTA) models as well as researchers' post-OCR correction model examined in methodology.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Yung-Hsin Chen, Yuli Zhou

License: CC BY 4.0

Abstract: The study investigates the potential of post-OCR models to overcome limitations in OCR models and explores the impact of incorporating glyph embedding on post-OCR correction performance. In this study, we have developed our own post-OCR correction model. The novelty of our approach lies in embedding the OCR output using CharBERT and our unique embedding technique, capturing the visual characteristics of characters. Our findings show that post-OCR correction effectively addresses deficiencies in inferior OCR models, and glyph embedding enables the model to achieve superior results, including the ability to correct individual words.

Submitted to arXiv on 29 Aug. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2308.15262v1

The study focuses on the potential of post-OCR models to overcome limitations in OCR models and explores the impact of incorporating glyph embedding on post-OCR correction performance. The researchers have developed their own post-OCR correction model, which utilizes CharBERT for embedding the OCR output and a unique technique for capturing the visual characteristics of characters. To conduct their study, the researchers utilize two prominent datasets: the ICDAR 2013 dataset and the recently introduced ICDAR 2023 dataset. The ICDAR 2013 dataset consists of diverse document images with printed and handwritten text captured under various conditions, serving as a robust benchmark for evaluating OCR systems. On the other hand, the ICDAR 2023 dataset expands upon its predecessor by introducing additional challenges such as intricate document layouts, degraded text, low-resolution images, and challenging background clutter. Three OCR models are evaluated using these datasets: EasyOCR, PaddleOCR, and TrOCR. EasyOCR is used specifically for single word detection box functionality, while PaddleOCR and TrOCR provide single line detection box functionality. The outputs from EasyOCR consist solely of single words, while PaddleOCR generates output in sentence form. For training the glyph embedding in their post-OCR correction model, the researchers rely on the Chars74K dataset. This dataset includes scene images of English and Kannada characters, although only English alphabets are utilized for this particular experiment. Additionally, screenshots of Korean and Hebrew characters are captured to serve as samples for the garbage class in open-set classification. In terms of methodology, state-of-the-art (SOTA) models as well as the researchers' post-OCR correction model are examined.
Created on 30 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.