Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective

Aveen Dayal; Peketi Divya; Nidhi Tiwari; Linga Reddy Cenkeramaddi; C Krishna Mohan; Abhinav Kumar

2026 WACV WACV 2026

Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective

Abstract

Small Multimodal Models (SMMs) suffer under distribution shift after fine-tuning. Unsupervised Domain Adaptation (UDA) is a common remedy for this issue, but existing theory and methods are designed primarily for single- or dual-encoder architectures, overlooking the encoder-decoder structure of SMMs, whose fusion mechanism introduces additional shift. This work bridges this gap in two steps. First, we derive a dual-divergence risk bound that separates encoder divergence from fusion divergence and illustrate its tightness compared to the classical encoder-only bound with a negation-flip example. Second, motivated by this theory, we propose Dual-level Adversarial Alignment (DuAA), a two-stage alignment algorithm. DuAA inserts domain-discriminative adapters after the encoder and within the decoder to minimize both divergences. Furthermore, DuAA employs selective pseudo-labeling to refine target semantics. Our contribution targets domain shift in encoder-decoder SMMs and is agnostic to the fine-tuning mechanism, i.e., DuAA acts on internal representations, making it orthogonal to LoRA, or other fine-tuning variants. We adopt LoRA in experiments solely as a popular, parameter-efficient instantiation to keep the protocol fixed across settings. We compile twelve new cross-domain VQA tasks with distinct visual and textual shifts from existing datasets and observe that DuAA consistently outperforms standard fine-tuning across all tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio