2024 INTERSPEECH INTERSPEECH 2024

Attention-augmented X-vectors for the Evaluation of Mimicked Speech Using Sparse Autoencoder-LSTM framework

Abstract

This paper evaluates the quality of mimicked speech by computing the speaker embeddings. We propose an attention-augmented encoded speaker embedding for mimicking speaker evaluation. X-vector embeddings extracted from the spectral features are passed through a 1-D convolutional neural network (CNN) with an attention module. The resulting output is fed into a sparse autoencoder. Later, the encoded vector is fed to a long short-term memory (LSTM) -based scoring mechanism. The best mimicking artist is initially identified by a perception test. Later, the we investigate whether the LSTM-based self-attention model predicts the same artist. When the model identifies the mean opinion score(MOS)-identified artist with the highest probability (rank-1), we assume that one hit occurs. The performance evaluation is carried out with a mimicry dataset using top-X criteria. The experiment demonstrates efficacy in the proposed vector representation in competency evaluation of voice mimicking.

🐣 Hot Topic Early Bird — sparse autoencoder
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio