Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval

AI-generated keywords: Automation

AI-generated Key Points

  • Increasing automation in document understanding procedures across industries
  • Digitization of historical documents in archives and libraries
  • Introduction of Fetch-A-Set (FAS) benchmark for legislative historical document analysis systems
  • Focus on text-to-image retrieval and image-to-text topic extraction
  • Addressing challenges of large-scale document retrieval in historical contexts
  • Providing baselines and data for development and evaluation of robust historical document retrieval systems
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Adrià Molina, Oriol Ramos Terrades, Josep Lladós

Preprint for the manuscript accepted for publication in the DAS2024 LNCS proceedings
License: CC BY-NC-SA 4.0

Abstract: This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum.

Submitted to arXiv on 11 Jun. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2406.07315v1

, , , , In the era of increasing automation in document understanding procedures, there is a growing need for systems that can extract information, index, summarize, and assist in decision-making tasks across various industries. This trend extends to the digitization of historical documents in archives and libraries, where the automation of document management processes is becoming more prevalent. With historical data gaining importance in governmental bodies, heritage management is also undergoing a shift towards automation. The paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored specifically for legislative historical document analysis systems. FAS addresses the challenges of large-scale document retrieval in historical contexts by providing a vast repository of documents dating back to the XVII century. This repository serves as both a training resource and an evaluation benchmark for retrieval systems, filling a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. One key aspect of FAS is its focus on text-to-image retrieval for queries and image-to-text topic extraction from document fragments while accommodating varying levels of document legibility. The benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by a wide historical spectrum. The paper delves into two main challenges faced by heritage institutions when handling vast sources of historical documents. Firstly, it addresses the need for continuous indexing of databases to enable natural language queries through a "text-to-image" task known as topic spotting. Secondly, it explores the "image-to-text" task of information extraction to provide feasible sets of texts from images automatically categorizing historical data in archival procedures. By incorporating complex understanding tasks into historical document analysis systems, novel services are expected to emerge that will enhance our understanding of history. The proposed benchmark aims to evaluate the effectiveness of document understanding systems in fetching relevant information directly from natural text without relying on expensive OCR solutions, especially for large historical collections with significant temporal variance. Overall, this work contributes significantly to advancing research in historical document analysis by addressing key challenges associated with topic-aware document retrieval and providing valuable insights into improving system performance in this domain.
Created on 02 Oct. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.