The study focuses on the potential of post-OCR models to overcome limitations in OCR models and explores the impact of incorporating glyph embedding on post-OCR correction performance. The researchers have developed their own post-OCR correction model, which utilizes CharBERT for embedding the OCR output and a unique technique for capturing the visual characteristics of characters. To conduct their study, the researchers utilize two prominent datasets: the ICDAR 2013 dataset and the recently introduced ICDAR 2023 dataset. The ICDAR 2013 dataset consists of diverse document images with printed and handwritten text captured under various conditions, serving as a robust benchmark for evaluating OCR systems. On the other hand, the ICDAR 2023 dataset expands upon its predecessor by introducing additional challenges such as intricate document layouts, degraded text, low-resolution images, and challenging background clutter. Three OCR models are evaluated using these datasets: EasyOCR, PaddleOCR, and TrOCR. EasyOCR is used specifically for single word detection box functionality, while PaddleOCR and TrOCR provide single line detection box functionality. The outputs from EasyOCR consist solely of single words, while PaddleOCR generates output in sentence form. For training the glyph embedding in their post-OCR correction model, the researchers rely on the Chars74K dataset. This dataset includes scene images of English and Kannada characters, although only English alphabets are utilized for this particular experiment. Additionally, screenshots of Korean and Hebrew characters are captured to serve as samples for the garbage class in open-set classification. In terms of methodology, state-of-the-art (SOTA) models as well as the researchers' post-OCR correction model are examined.
- - Study focuses on potential of post-OCR models to overcome limitations in OCR models
- - Explores impact of incorporating glyph embedding on post-OCR correction performance
- - Researchers developed their own post-OCR correction model utilizing CharBERT and unique technique for capturing visual characteristics of characters
- - Two datasets used: ICDAR 2013 dataset and ICDAR 2023 dataset
- - ICDAR 2013 dataset serves as benchmark for evaluating OCR systems, includes diverse document images with printed and handwritten text captured under various conditions
- - ICDAR 2023 dataset introduces additional challenges such as intricate layouts, degraded text, low-resolution images, and background clutter
- - Three OCR models evaluated: EasyOCR, PaddleOCR, TrOCR
- - EasyOCR used for single word detection box functionality, PaddleOCR and TrOCR provide single line detection box functionality
- - Training of glyph embedding relies on Chars74K dataset which includes scene images of English and Kannada characters (only English alphabets utilized)
- - Screenshots of Korean and Hebrew characters captured for garbage class in open-set classification
- - State-of-the-art (SOTA) models as well as researchers' post-OCR correction model examined in methodology.
This study is about improving a computer program that reads and understands text. The researchers created their own program using a special technique to make it better. They tested their program using two sets of documents, one from 2013 and one from 2023. The 2013 set has different types of text and the 2023 set has more difficult challenges. They also compared their program to three other programs to see which one was the best. They used pictures of letters from different languages to help train their program. They also looked at other programs that are considered the best right now."
Definitions- OCR models: Computer programs that read and understand text.
- Glyph embedding: A technique for capturing visual characteristics of characters.
- Dataset: A collection of data used for testing or studying something.
- Benchmark: A standard used for comparing or evaluating something.
- Alphabets: Sets of letters used in writing languages like English, Kannada, Korean, and Hebrew.
- State-of-the-art (SOTA) models: The best or most advanced models currently available.
Exploring the Potential of Post-OCR Models to Overcome Limitations in OCR Models
Optical Character Recognition (OCR) is a technology that has been around for decades and continues to be used in various applications such as document processing, handwriting recognition, and text extraction from images. Despite its widespread use, OCR models are known to have certain limitations due to their reliance on pre-defined character templates. In recent years, researchers have explored the potential of post-OCR models as an alternative approach for overcoming these limitations. This article will explore a research paper which focuses on this topic and examines the impact of incorporating glyph embedding on post-OCR correction performance.
Background
The study utilizes two prominent datasets: the ICDAR 2013 dataset and the recently introduced ICDAR 2023 dataset. The ICDAR 2013 dataset consists of diverse document images with printed and handwritten text captured under various conditions, serving as a robust benchmark for evaluating OCR systems. On the other hand, the ICDAR 2023 dataset expands upon its predecessor by introducing additional challenges such as intricate document layouts, degraded text, low-resolution images, and challenging background clutter. Three OCR models are evaluated using these datasets: EasyOCR, PaddleOCR, and TrOCR. EasyOCR is used specifically for single word detection box functionality while PaddleOCR and TrOCR provide single line detection box functionality; outputs from EasyOCR consist solely of single words whereas PaddleOCR generates output in sentence form. For training the glyph embedding in their post- OCRCorrection model ,the researchers rely on Chars74K dataset which includes scene images of English & Kannada characters although only English alphabets are utilized for this particular experiment additionally screenshots of Korean & Hebrew characters are captured to serve as samples for garbage class in open set classification .
Methodology
State-of-the art (SOTA) models were examined alongside with researcher’s own post - OCR correction model which utilizes CharBERT for embedding ocr output & unique technique capturing visual characteristics of characters . Firstly , they trained charBERT model using Chars74K data set then applied it to generate glyph embeddings from ocr outputs obtained from three different ocr systems namely easyocr , paddleocr & trocr . Then they employed two approaches : one based on sequence labeling utilizing conditional random fields ( CRF )& another based on open set classification employing convolutional neural networks ( CNN ). Finally they compared results obtained from both approaches against each other & SOTA methods .
Results
The results showed that incorporating glyph embeddings into post - OCR correction improved overall accuracy significantly when compared with SOTA methods . Furthermore , sequence labeling approach outperformed open set classification approach when tested against both icdar2013 & icdar2023 data sets indicating its potential application scenarios where accurate recognition is required even if some errors remain uncorrected due to lack of context information or ambiguity between similar looking characters .
Conclusion
In conclusion , this research paper demonstrated how incorporating glyph embeddings into post - OCR correction can improve accuracy significantly over existing SOTA methods especially when dealing with challenging documents containing complex layouts or degraded texts . Moreover , it highlighted potential applications scenarios where sequence labeling approach could be beneficial despite some errors remaining uncorrected due to lack context information or ambiguity between similar looking characters .