Video-Audio Domain Generalization via Confounder Disentanglement

Shengyu Zhang; Xusheng Feng; Wenyan Fan; Wenjing Fang; Fuli Feng; Wei Ji; Shuo Li; Li Wang; Shanshan Zhao; Zhou Zhao; Tat-Seng Chua; Fei Wu

2023 AAAI AAAI 2023

Video-Audio Domain Generalization via Confounder Disentanglement

Abstract

Abstract Existing video-audio understanding models are trained and evaluated in an intra-domain setting, facing performance degeneration in real-world applications where multiple domains and distribution shifts naturally exist. The key to video-audio domain generalization (VADG) lies in alleviating spurious correlations over multi-modal features. To achieve this goal, we resort to causal theory and attribute such correlation to confounders affecting both video-audio features and labels. We propose a DeVADG framework that conducts uni-modal and cross-modal deconfounding through back-door adjustment. DeVADG performs cross-modal disentanglement and obtains fine-grained confounders at both class-level and domain-level using half-sibling regression and unpaired domain transformation, which essentially identifies domain-variant factors and class-shared factors that cause spurious correlations between features and false labels. To promote VADG research, we collect a VADG-Action dataset for video-audio action recognition with over 5,000 video clips across four domains (e.g., cartoon and game) and ten action classes (e.g., cooking and riding). We conduct extensive experiments, i.e., multi-source DG, single-source DG, and qualitative analysis, validating the rationality of our causal analysis and the effectiveness of the DeVADG framework.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — confounder disentanglement

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shengyu Zhang , Xusheng Feng , Wenyan Fan , Wenjing Fang , Fuli Feng , Wei Ji , Shuo Li , Li Wang , Shanshan Zhao , Zhou Zhao , Tat-Seng Chua , Fei Wu

Topics

Artificial Intelligence > Core AI > Causal Inference Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Generalization Computer Vision > Analysis > Action Recognition Machine Learning > Learning Types > Domain Generalization Deep Learning > Learning Types > Multi-Modal Learning

Keywords

causal inference action recognition domain generalization multimodal learning cross-modal learning multi-modal learning video understanding backdoor adjustment confounder disentanglement

Download PDF

Related papers

A Model-Agnostic Heuristics for Selective Classification 2023

Tackling Safe and Efficient Multi-Agent Reinforcement Learning via Dynamic Shielding (Student Abstract) 2023

Head-Free Lightweight Semantic Segmentation with Linear Transformer 2023

Hierarchical ConViT with Attention-Based Relational Reasoner for Visual Analogical Reasoning 2023

Deep Spiking Neural Networks with High Representation Similarity Model Visual Pathways of Macaque and Mouse 2023