Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment

Zheren Fu; Lei Zhang; Hou Xia; Zhendong Mao

2024 CVPR CVPR 2024

Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment

Abstract

Cross-modal alignment aims to build a bridge connecting vision and language. It is an important multi-modal task that efficiently learns the semantic similarities between images and texts. Traditional fine-grained alignment methods heavily rely on pre-trained object detectors to extract region features for subsequent region-word alignment thereby incurring substantial computational costs for region detection and error propagation issues for two-stage training. In this paper we focus on the mainstream vision transformer incorporating patch features for patch-word alignment while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. We propose a novel Linguistic-Aware Patch Slimming (LAPS) framework for fine-grained alignment which explicitly identifies redundant visual patches with language supervision and rectifies their semantic and spatial information to facilitate more effective and consistent patch-word alignment. Extensive experiments on various evaluation benchmarks and model backbones show LAPS outperforms the state-of-the-art fine-grained alignment methods by 5%-15% rSum. Our code is available at https://github.com/CrossmodalGroup/LAPS

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — patch-word alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zheren Fu , Lei Zhang , Hou Xia , Zhendong Mao

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Domain Adaptation Deep Learning > Architectures > Transformers Machine Learning > Learning Types > Representation Learning Machine Learning > Core Methods > Feature Learning Computer Vision > Core AI > Multimodal Learning Natural Language Processing > Resources & Methods > Multimodal NLP Deep Learning > Learning Types > Multi-Modal Learning

Keywords

vision transformer semantic alignment semantic similarity cross-modal alignment language supervision patch feature fine-grained alignment vision language visual transformer patch-word alignment visual patch redundancy linguistic-aware slimming

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024