Patchfinder: Leveraging Visual Language Models for Accurate Information Retrieval using Model Uncertainty

AI-generated keywords: Information Extraction Scanned Documents Vision Language Models PatchFinder Algorithm Performance Improvement

AI-generated Key Points

  • Traditional OCR methods struggle with noisy backgrounds, varied fonts, and handwritten content
  • Additional processing steps using large language models (LLMs) are needed to structure extracted information
  • Two-step approach can introduce errors and inefficiencies due to context loss and limited flexibility
  • Vision Language Models (VLMs) offer a promising solution for efficient and accurate information extraction from scanned documents
  • PatchFinder algorithm is introduced for extracting information from scanned documents by leveraging VLMs
  • PatchFinder uses Patch Confidence score to determine optimal patch size for partitioning input document into patches
  • PatchFinder demonstrates significant performance improvements over existing methods by integrating visual and text information effectively
  • Experimental results show PatchFinder's accuracy rate of 94% on a dataset of 190 noisy scanned documents, surpassing existing models by 18.5 percentage points
  • Preliminary results on datasets from Colorado and Pennsylvania showcase PatchFinder's superior performance in handling complex layouts and noisy backgrounds
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Roman Colman, Minh Vu, Manish Bhattarai, Martin Ma, Hari Viswanathan, Daniel O'Malley, Javier E. Santos

This paper has been accepted to IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025
License: CC BY 4.0

Abstract: For decades, corporations and governments have relied on scanned documents to record vast amounts of information. However, extracting this information is a slow and tedious process due to the overwhelming amount of documents. The rise of vision language models presents a way to efficiently and accurately extract the information out of these documents. The current automated workflow often requires a two-step approach involving the extraction of information using optical character recognition software, and subsequent usage of large language models for processing this information. Unfortunately, these methods encounter significant challenges when dealing with noisy scanned documents. The high information density of such documents often necessitates using computationally expensive language models to effectively reduce noise. In this study, we propose PatchFinder, an algorithm that builds upon Vision Language Models (VLMs) to address the information extraction task. First, we devise a confidence-based score, called Patch Confidence, based on the Maximum Softmax Probability of the VLMs' output to measure the model's confidence in its predictions. Then, PatchFinder utilizes that score to determine a suitable patch size, partition the input document into overlapping patches of that size, and generate confidence-based predictions for the target information. Our experimental results show that PatchFinder can leverage Phi-3v, a 4.2 billion parameter vision language model, to achieve an accuracy of 94% on our dataset of 190 noisy scanned documents, surpassing the performance of ChatGPT-4o by 18.5 percentage points.

Submitted to arXiv on 03 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2412.02886v1

In the realm of information extraction from scanned documents, traditional OCR methods have long been relied upon to convert images of text into machine-readable formats. However, these methods often struggle with noisy backgrounds, varied fonts, and handwritten content. This necessitates additional processing steps using large language models (LLMs) to structure the extracted information. While this two-step approach can be effective, it can also introduce errors and inefficiencies due to context loss and limited flexibility in handling diverse layouts and noise levels. The emergence of Vision Language Models (VLMs) offers a promising solution to enhance the efficiency and accuracy of information extraction tasks from scanned documents. VLMs combine both visual and language components internally, enabling a more streamlined end-to-end workflow compared to traditional OCR methods. Leveraging the strengths of VLMs, this study introduces PatchFinder - a novel algorithm designed for extracting information from scanned documents. PatchFinder operates by first assessing the VLM's confidence in its predictions using a newly proposed confidence-based score called Patch Confidence. Based on this assessment, PatchFinder determines an optimal patch size for partitioning the input document into overlapping patches. It then generates confidence-based predictions for extracting target information from each patch. By effectively integrating visual and text information, PatchFinder demonstrates significant performance improvements over existing methods. Experimental results showcase PatchFinder's effectiveness in leveraging Phi-3v - a 4.2 billion parameter vision language model - achieving an impressive accuracy rate of 94% on a dataset comprising 190 noisy scanned documents. This performance surpasses existing models like ChatGPT-4o by 18.5 percentage points, highlighting the potential of VLMs to revolutionize document analysis tasks for complex and noisy documents. Furthermore, this study establishes a baseline performance using state-of-the-art open-source models on datasets from Colorado and Pennsylvania. The preliminary results demonstrate PatchFinder's superior performance compared to other methods on these datasets, showcasing its efficacy in handling complex layouts and noisy backgrounds. Overall, this research underscores the transformative impact of VLMs in enhancing information extraction processes from scanned documents and sets a new benchmark for accuracy and efficiency in document analysis tasks.
Created on 18 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.