LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding

Min Liang; Jia-Wei Ma; Xiaobin Zhu; Jingyan Qin; Xu-Cheng Yin

2024 CVPR CVPR 2024

LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding

Abstract

Existing scene text detectors generally focus on accurately detecting single-level (i.e. word-level line-level or paragraph-level) text entities without exploring the relationships among different levels of text entities. To comprehensively understand scene texts detecting multi-level texts while exploring their contextual information is critical. To this end we propose a unified framework (dubbed LayoutFormer) for hierarchical text detection which simultaneously conducts multi-level text detection and predicts the geometric layouts for promoting scene text understanding. In LayoutFormer WordDecoder LineDecoder and ParaDecoder are proposed to be responsible for word-level text prediction line-level text prediction and paragraph-level text prediction respectively. Meanwhile WordDecoder and ParaDecoder adaptively learn word-line and line-paragraph relationships respectively. In addition we propose a Prior Location Sampler to be used on multi-scale features to adaptively select a few representative foreground features for updating text queries. It can improve hierarchical detection performance while significantly reducing the computational cost. Comprehensive experiments verify that our method achieves state-of-the-art performance on single-level and hierarchical text detection.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — hierarchical text detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Min Liang , Jia-Wei Ma , Xiaobin Zhu , Jingyan Qin , Xu-Cheng Yin

Topics

Deep Learning > Architectures > Transformers Computer Vision > Analysis > Object Detection Computer Vision > Analysis > Scene Understanding Natural Language Processing > Applications > Information Extraction Artificial Intelligence > Core AI > Computer Vision Artificial Intelligence > Core AI > Multi-Modal Learning Computer Vision > Analysis > Computer Vision

Keywords

transformer architecture scene text detection transformer decoder text detection neural decoder text recognition scene text hierarchical detection hierarchical text detection scene text understanding geometric layout

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024