HarmoNet: Partial DeepFake Detection Network based on Multi-scale HarmoF0 Feature Fusion

Liwei Liu; Huihui Wei; Dongya Liu; Zhonghua Fu

2024 INTERSPEECH INTERSPEECH 2024

HarmoNet: Partial DeepFake Detection Network based on Multi-scale HarmoF0 Feature Fusion

Abstract

Audio DeepFake detection (ADD) has become an increasingly challenging task recently, with the rise of various spoofing attacks utilizing artificially generated audio. The track 2 of ADD 2023 requires not only detecting DeepFake audio but also locating the manipulated regions. To tackle this unique challenge, we have proposed an innovative framework HarmoNet that leverages the Multi-scale harmonic F0 and Wav2Vec features with attention mechanism. This allows the model to effectively capture changes in each region of the utterance. Furthermore, we have introduced a new loss function named Partial Loss, which focuses more on the boundary between real and fake region. Additionally, we have designed a post-processor to refine the output of the model. Our framework achieved 70.61% in track 2 of ADD 2023, an improvement of 67.12% over baseline, and achieved the best performance. Moreover, HarmoNet also shows competitive performance on other DeepFake datasets.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — harmonic f0

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Liwei Liu , Huihui Wei , Dongya Liu , Zhonghua Fu

Topics

Deep Learning > Techniques > Model Architecture Computer Vision > Analysis > Anomaly Detection

Keywords

attention mechanism deepfake detection audio deepfake partial loss wav2vec feature harmonic f0

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024