RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

yujie Chen; Jiangyan Yi; Jun Xue; Chenglong Wang; XiaoHui Zhang; Shunbo Dong; Siding Zeng; Jianhua Tao; Zhao Lv; Cunhang Fan

2024 INTERSPEECH INTERSPEECH 2024

RawBMamba: End-to-End Bidirectional State Space Model for Audio Deepfake Detection

Abstract

Fake artefacts for discriminating between bonafide and fake audio can exist in both short- and long-range segments. Therefore, combining local and global feature information can effectively discriminate between bonafide and fake audio. This paper proposes an end-to-end bidirectional state space model, named RawBMamba, to capture both short- and long-range discriminative information for audio deepfake detection. Specifically, we use sinc Layer and multiple convolutional layers to capture short-range features, and then design a bidirectional Mamba to address Mamba's unidirectional modelling problem and further capture long-range feature information. Moreover, we develop a bidirectional fusion module to integrate embeddings, enhancing audio context representation and combining short- and long-range information. The results show that our proposed RawBMamba achieves a 34.1% improvement over Rawformer on ASVspoof2021 LA dataset, and demonstrates competitive performance on other datasets. Codes will be released on https://github.com/cyjie429/RawBMamba.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

yujie Chen , Jiangyan Yi , Jun Xue , Chenglong Wang , XiaoHui Zhang , Shunbo Dong , Siding Zeng , Jianhua Tao , Zhao Lv , Cunhang Fan

Topics

Machine Learning > Core Methods > Representation Learning Computer Vision > Analysis > Anomaly Detection Computer Vision > Processing > Audio Processing

Keywords

feature extraction state space model audio deepfake detection convolutional layer bidirectional model

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024