Multi-modal Deepfake Detection via Multi-task Audio-Visual Prompt Learning

Hui Miao; Yuanfang Guo; Zeming Liu; Yunhong Wang

2025 AAAI AAAI 2025

Multi-modal Deepfake Detection via Multi-task Audio-Visual Prompt Learning

Abstract

Abstract With the malicious use and dissemination of multi-modal deepfake videos, researchers start to investigate multi-modal deepfake detection. Unfortunately, most of the existing methods tune all the parameters of the deep network with limited speech video datasets and are trained under coarse-grained consistency supervision, which hinders their generalization ability in practical scenarios. To solve these problems, in this paper, we propose the first multi-task audio-visual prompt learning method for multi-modal deepfake video detection, by exploiting multiple foundation models. Specifically, we construct a two-stream multi-task learning architecture and propose sequential visual prompts and short-time audio prompts to extract multi-modal features, which are aligned at the frame level and utilized in subsequent fine-grained feature matching and fusion. Due to the natural alignment of visual content and audio signal in real data, we propose a frame-level cross-modal feature matching loss function to learn the fine-grained audio-visual consistency. Comprehensive experiments demonstrate the effectiveness and superior generalization ability of our method against the state-of-the-art methods.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — multi-modal deepfake detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hui Miao , Yuanfang Guo , Zeming Liu , Yunhong Wang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Few-Shot Learning Computer Vision > Analysis > Anomaly Detection Computer Vision > Core AI > Computer Vision Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Fine-Tuning Computer Vision > Analysis > Computer Vision

Keywords

multi-task learning multimodal learning audio-visual learning deepfake detection prompt learning foundation model fine-grained matching cross-modal feature multi-modal deepfake detection audio-visual prompt learning fine-grained feature matching fine-grained consistency

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025