Cross-Domain Multi-Modal Few-Shot Object Detection via Rich Text

Zeyu Shangguan; Daniel Seita; Mohammad Rostami

2025 WACV WACV 2025

Cross-Domain Multi-Modal Few-Shot Object Detection via Rich Text

Abstract

Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks. However existing multi-modal object detection (MM-OD) methods degrade when facing significant domain shift and are sample insufficient. We hypothesize that rich text information could more effectively help the model to build a knowledge relationship between the vision instance and its language description and can help mitigate domain shift. Specifically we study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method that utilizes rich text semantic information as an auxiliary modality to achieve domain adaptation. Our proposed novel neural network contains a multi-modal feature aggregation module that aligns the vision and language support feature embeddings and a rich text semantic rectify module that utilizes bidirectional text feature generation to reinforce multi-modal feature alignment and thus to enhance the model's language understanding capability. We evaluate our model on common standard cross-domain object detection datasets and demonstrate that our approach considerably outperforms existing FSOD methods. Our implementation is publicly available: https://github.com/zshanggu/CDMM

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zeyu Shangguan , Daniel Seita , Mohammad Rostami

Topics

Artificial Intelligence > Learning Paradigms > Few-Shot Learning Machine Learning > Application Areas > Domain Adaptation Computer Vision > Analysis > Object Detection Machine Learning > Learning Paradigms > Few-Shot Learning Machine Learning > Learning Types > Multi-Modal Learning

Keywords

few-shot learning domain adaptation object detection multi-modal learning cross-domain generalization few-shot object detection multi-modal object detection rich text

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025