2024 WACV WACV 2024

Textual Alchemy: CoFormer for Scene Text Understanding

Abstract

The paper presents CoFormer (Convolutional Fourier Transformer), a robust and adaptable transformer architecture designed for a range of scene text tasks. CoFormer integrates convolution and Fourier operations into the transformer architecture. Thus, it leverages convolution properties such as shared weights, local receptive fields, and spatial subsampling, while the Fourier operation emphasizes composite characteristics from the frequency domain. The research further proposes the first pretraining datasets, named Textverse10M-E and Textverse10M-H. Using these datasets, we demonstrate the efficacy of pretraining for scene text understanding. CoFormer achieves state-of-theart results with and without pretraining on two downstream tasks: scene text recognition and scene text style transfer. The paper presents LISTNet (Language Invariant Style Transfer), a novel framework for bi-lingual scene text style transfer. It also introduces three datasets, viz., TST500K for scene text style transfer, CSTR2.5M and Akshara550 for scene text recognition.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio