Bridging the Domain Gap in Small Multimodal Models: A Dual-level Alignment Perspective
Abstract
Small Multimodal Models (SMMs) suffer under distribution shift after fine-tuning. Unsupervised Domain Adaptation (UDA) is a common remedy for this issue, but existing theory and methods are designed primarily for single- or dual-encoder architectures, overlooking the encoder-decoder structure of SMMs, whose fusion mechanism introduces additional shift. This work bridges this gap in two steps. First, we derive a dual-divergence risk bound that separates encoder divergence from fusion divergence and illustrate its tightness compared to the classical encoder-only bound with a negation-flip example. Second, motivated by this theory, we propose Dual-level Adversarial Alignment (DuAA), a two-stage alignment algorithm. DuAA inserts domain-discriminative adapters after the encoder and within the decoder to minimize both divergences. Furthermore, DuAA employs selective pseudo-labeling to refine target semantics. Our contribution targets domain shift in encoder-decoder SMMs and is agnostic to the fine-tuning mechanism, i.e., DuAA acts on internal representations, making it orthogonal to LoRA, or other fine-tuning variants. We adopt LoRA in experiments solely as a popular, parameter-efficient instantiation to keep the protocol fixed across settings. We compile twelve new cross-domain VQA tasks with distinct visual and textual shifts from existing datasets and observe that DuAA consistently outperforms standard fine-tuning across all tasks.