ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform

Junyi Peng; Xiaoyang Qu; Jianzong Wang; Rongzhi Gu; Jing Xiao; Lukas Burget; Jan Černocký

2021 INTERSPEECH INTERSPEECH 2021

ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform

Abstract

Recently, extracting speaker embedding directly from raw waveform has drawn increasing attention in the field of speaker verification. Parametric real-valued filters in the first convolutional layer are learned to transform the waveform into time-frequency representations. However, these methods only focus on the magnitude spectrum and the poor interpretability of the learned filters limits the performance. In this paper, we propose a complex speaker embedding extractor, named ICSpk, with higher interpretability and fewer parameters. Specifically, at first, to quantify the speaker-related frequency response of waveform, we modify the original short-term Fourier transform filters into a family of complex exponential filters, named interpretable complex (IC) filters. Each IC filter is confined by a complex exponential filter parameterized by frequency. Then, a deep complex-valued speaker embedding extractor is designed to operate on the complex-valued output of IC filters. The proposed ICSpk is evaluated on VoxCeleb and CNCeleb databases. Experimental results demonstrate the IC filters-based system exhibits a significant improvement over the complex spectrogram based systems. Furthermore, the proposed ICSpk outperforms existing raw waveform based systems by a large margin.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — short-term fourier transform

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio

Authors

Junyi Peng , Xiaoyang Qu , Jianzong Wang , Rongzhi Gu , Jing Xiao , Lukas Burget , Jan Černocký

Topics

Machine Learning > Core Methods > Embedding Learning Deep Learning > Architectures > Neural Networks Deep Learning > Techniques > Model Architecture Speech & Audio > Recognition > Speaker Recognition

Keywords

speaker embedding speaker verification short-term fourier transform complex-valued neural network raw waveform complex-valued network

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021