Alignment and Distillation: A Robust Framework for Multimodal Domain Generalizable Human Action Recognition
Abstract
Human Action Recognition (HAR) in real-world scenarios is significantly challenged by unseen domain shifts, such as variations in the camera viewpoint, illumination, lighting, or background. Although recent advancements in video domain generalization are promising for HAR by introducing models that are robust to these shifts, existing methods often fall short. They typically depend on a single modality or employ static frame-level fusion approaches, which inherently limit the capture of multi-scale temporal dependencies and the alignment of the asynchronous modalities frequently present in video data. To address these limitations, we propose Multimodal Alignment and Distillation for Domain Generalization (MAD-DG), a novel framework that synchronizes asynchronous modalities through a segment-label aligned temporal binding window with a contrastive alignment mechanism. We further incorporate an online self-distillation temporal module that captures multi-scale temporal relationships and learns robust, domain-invariant representations. Extensive experiments demonstrate that MAD-DG achieves state-of-the-art performance and exhibits better generalization capabilities across both single- and multi-source domain generalization settings. Our source code is available at https://github.com/dxlabskku/MAD-DG.