SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition

AI-generated keywords: End-to-end scene text spotting SwinTextSpotter Transformer encoder Recognition Conversion mechanism Feature interaction

AI-generated Key Points

  • End-to-end scene text spotting is gaining attention for combining scene text detection and recognition
  • Current methods merge detection and recognition by sharing a backbone, limiting feature interaction
  • SwinTextSpotter introduces a novel framework using a transformer encoder for detection and Recognition Conversion mechanism for unifying localization and recognition
  • SwinTextSpotter eliminates the need for rectification modules or character-level annotations for arbitrarily-shaped text
  • Demonstrated effectiveness through experiments on various datasets, outperforming existing methods significantly
  • Ablation studies show the impact of key components like Recognition Conversion in improving performance
  • SwinTextSpotter enhances feature interaction between detection and recognition tasks, presenting a promising approach to scene text spotting
  • The framework's simplicity and effectiveness make it valuable for researchers and practitioners in the field
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Mingxin Huang, Yuliang Liu, Zhenghao Peng, Chongyu Liu, Dahua Lin, Shenggao Zhu, Nicholas Yuan, Kai Ding, Lianwen Jin

Accepted to be appeared in CVPR 2022
License: CC BY-NC-SA 4.0

Abstract: End-to-end scene text spotting has attracted great attention in recent years due to the success of excavating the intrinsic synergy of the scene text detection and recognition. However, recent state-of-the-art methods usually incorporate detection and recognition simply by sharing the backbone, which does not directly take advantage of the feature interaction between the two tasks. In this paper, we propose a new end-to-end scene text spotting framework termed SwinTextSpotter. Using a transformer encoder with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism to explicitly guide text localization through recognition loss. The straightforward design results in a concise framework that requires neither additional rectification module nor character-level annotation for the arbitrarily-shaped text. Qualitative and quantitative experiments on multi-oriented datasets RoIC13 and ICDAR 2015, arbitrarily-shaped datasets Total-Text and CTW1500, and multi-lingual datasets ReCTS (Chinese) and VinText (Vietnamese) demonstrate SwinTextSpotter significantly outperforms existing methods. Code is available at https://github.com/mxin262/SwinTextSpotter.

Submitted to arXiv on 19 Mar. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2203.10209v1

In recent years, end-to-end scene text spotting has gained significant attention for its ability to leverage the synergy between scene text detection and recognition. However, current state-of-the-art methods typically merge detection and recognition by sharing a backbone, which fails to fully exploit the feature interaction between the two tasks. To address this limitation, a novel end-to-end scene text spotting framework called SwinTextSpotter is introduced in this paper. By utilizing a transformer encoder with dynamic head as the detector and incorporating a Recognition Conversion mechanism, the framework unifies text localization and recognition through explicit guidance via recognition loss. This streamlined design eliminates the need for additional rectification modules or character-level annotations for arbitrarily-shaped text. The effectiveness of SwinTextSpotter is demonstrated through qualitative and quantitative experiments on various datasets including RoIC13, ICDAR 2015, Total-Text, CTW1500, ReCTS (Chinese), and VinText (Vietnamese). The results show that SwinTextSpotter outperforms existing methods significantly. Additionally, ablation studies conducted on Total-Text using ResNet-50 as the baseline backbone confirm the impact of key components such as Recognition Conversion in improving both detection and end-to-end scene text spotting performance. Overall, SwinTextSpotter presents a promising approach to scene text spotting by enhancing feature interaction between detection and recognition tasks. The framework's simplicity and effectiveness make it a valuable tool for researchers and practitioners in the field. Further exploration of its limitations and potential improvements could lead to even more robust performance in real-world applications.
Created on 27 Jun. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.