Exploiting Wavelet Scattering Transform for an Unsupervised Speaker Diarization in Deep Neural Network Framework

Arunav Arya; Murtiza Ali; Karan Nathwani

2024 INTERSPEECH INTERSPEECH 2024

Exploiting Wavelet Scattering Transform for an Unsupervised Speaker Diarization in Deep Neural Network Framework

Abstract

Advancements in diarization have prompted the development of supervised learning models. These models extract fixed-length embeddings from audio files of varying lengths. Despite challenges, commercial API models like Speechbrain, Resemblyzer, Whisper AI, and Pyannote have addressed this issue. However, these models typically utilize Mel-Frequency Cepstral Coefficients (MFCC) features, convolution layers, and dimension reduction techniques to create embeddings. Our proposal method introduces a Wavelet Scattering Transform (WST) that prioritizes information content, allowing users to customize the shape of embeddings according to their model requirements. Coupling WST with AutoEncoders (WST-AE) in a residual manner enhances semantic latent space representations, which can be clustered segment-wise in an unsupervised manner. Testing on AMI and VoxConverse datasets has shown a reduction in Diarization Error Rate (DER) with fewer training parameters and without the need for separate embedding models.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Arunav Arya , Murtiza Ali , Karan Nathwani

Topics

Machine Learning > Core Methods > Clustering Machine Learning > Learning Types > Unsupervised Learning Deep Learning > Architectures > Autoencoders

Keywords

unsupervised clustering speaker diarization residual connection wavelet scattering transform

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024