AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization

Christos Koutlis; Symeon Papadopoulos

2026 WACV WACV 2026

AuViRe: Audio-visual Speech Representation Reconstruction for Deepfake Temporal Localization

Abstract

With the rapid advancement of sophisticated synthetic audio-visual content, e.g., for subtle malicious manipulations, ensuring the perceptual integrity of digital media has become paramount. This work presents a novel approach to temporal localization of deepfakes by leveraging Audio-Visual Speech Representation Reconstruction (AuViRe). Specifically, our approach reconstructs speech representations from one modality (e.g., visual lip movements) based on the other (e.g., audio waveform). This cross-modal reconstruction becomes significantly more challenging, leading to amplified discrepancies, in manipulated regions, thereby providing robust discriminative cues for precise forgery localization. AuViRe outperforms the State-of-the-Art by +8.9 AP@0.95 on LAV-DF, +9.6 AP@0.5 on AV-Deepfake1M, and +5.1 AUC on an in-the-wild experiment. Code will be publicly available upon acceptance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio