Vision-Language Pre-Training for Boosting Scene Text Detectors

Sibo Song; Jianqiang Wan; Zhibo Yang; Jun Tang; Wenqing Cheng; Xiang Bai; Cong Yao

2022 CVPR CVPR 2022

Vision-Language Pre-Training for Boosting Scene Text Detectors

Abstract

Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.

🌉 Interdisciplinary Bridge — Computer Science and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — image-text contrastive learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sibo Song , Jianqiang Wan , Zhibo Yang , Jun Tang , Wenqing Cheng , Xiang Bai , Cong Yao

Topics

Machine Learning > Application Areas > Domain Adaptation Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Computer Science > Applications > Document Analysis Natural Language Processing > Resources & Methods > Transfer Learning Deep Learning > Learning Types > Multi-Modal Learning Computer Vision > Applications > Computer Vision Computer Vision > Applications > Document Analysis

Keywords

multimodal learning cross-modal learning scene text detection vision-language model vision-language pre-training masked language modeling cross-modal interaction text recognition image-text contrastive learning cross-modal encoder word-in-image prediction

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022