Attention-augmented X-vectors for the Evaluation of Mimicked Speech Using Sparse Autoencoder-LSTM framework

Bhasi K. C.; Rajeev Rajan; Noumida A

2024 INTERSPEECH INTERSPEECH 2024

Attention-augmented X-vectors for the Evaluation of Mimicked Speech Using Sparse Autoencoder-LSTM framework

Abstract

This paper evaluates the quality of mimicked speech by computing the speaker embeddings. We propose an attention-augmented encoded speaker embedding for mimicking speaker evaluation. X-vector embeddings extracted from the spectral features are passed through a 1-D convolutional neural network (CNN) with an attention module. The resulting output is fed into a sparse autoencoder. Later, the encoded vector is fed to a long short-term memory (LSTM) -based scoring mechanism. The best mimicking artist is initially identified by a perception test. Later, the we investigate whether the LSTM-based self-attention model predicts the same artist. When the model identifies the mean opinion score(MOS)-identified artist with the highest probability (rank-1), we assume that one hit occurs. The performance evaluation is carried out with a mimicry dataset using top-X criteria. The experiment demonstrates efficacy in the proposed vector representation in competency evaluation of voice mimicking.

🐣 Hot Topic Early Bird — sparse autoencoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bhasi K. C. , Rajeev Rajan , Noumida A

Topics

Deep Learning > Architectures > Autoencoders Deep Learning > Architectures > Neural Networks

Keywords

attention mechanism speaker embedding convolutional neural network sparse autoencoder lstm network

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024