Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks?

Yapeng Tian; Chenliang Xu

2021 CVPR CVPR 2021

Can Audio-Visual Integration Strengthen Robustness Under Multimodal Attacks?

Abstract

In this paper, we propose to make a systematic study on machines' multisensory perception under attacks. We use the audio-visual event recognition task against multimodal adversarial attacks as a proxy to investigate the robustness of audio-visual learning. We attack audio, visual, and both modalities to explore whether audio-visual integration still strengthens perception and how different fusion mechanisms affect the robustness of audio-visual models. For interpreting the multimodal interactions under attacks, we learn a weakly-supervised sound source visual localization model to localize sounding regions in videos. To mitigate multimodal attacks, we propose an audio-visual defense approach based on an audio-visual dissimilarity constraint and external feature memory banks. Extensive experiments demonstrate that audio-visual models are susceptible to multimodal adversarial attacks; audio-visual integration could decrease the model robustness rather than strengthen under multimodal attacks; even a weakly-supervised sound source visual localization model can be successfully fooled; our defense method can improve the invulnerability of audio-visual networks without significantly sacrificing clean model performance.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — audio-visual recognition

🐣 Hot Topic Early Bird — audio-visual learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yapeng Tian , Chenliang Xu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Weakly Supervised Learning Machine Learning > Application Areas > Privacy Artificial Intelligence > Core AI > Adversarial Learning Deep Learning > Learning Types > Multi-Modal Learning Machine Learning > Learning Types > Robustness

Keywords

adversarial robustness model robustness multimodal learning audio-visual learning sound localization weakly-supervised learning adversarial attack audio-visual recognition multimodal defense

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021