2020 INTERSPEECH INTERSPEECH 2020

Attention-Based Speaker Embeddings for One-Shot Voice Conversion

Abstract

This paper proposes a novel approach to embed speaker information to feature vectors at frame level using an attention mechanism, and its application to one-shot voice conversion. A one-shot voice conversion system is a type of voice conversion system where only one utterance from a target speaker is available for conversion. In many one-shot voice conversion systems, a speaker encoder mechanism compresses an utterance of the target speaker into a fixed-size vector for propagating speaker information. However, the obtained representation has lost temporal information related to speaker identities and it could degrade conversion quality. To alleviate this problem, we propose a novel way to embed speaker information using an attention mechanism. Instead of compressing into a fixed-size vector, our proposed speaker encoder outputs a sequence of speaker embedding vectors. The obtained sequence is selectively combined with input frames of a source speaker by an attention mechanism. Finally the obtained time varying speaker information is utilized for a decoder to generate the converted features. Objective evaluation showed that our method reduced the averaged mel-cepstrum distortion to 5.23 dB from 5.34 dB compared with the baseline system. The subjective preference test showed that our proposed system outperformed the baseline one.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio
🧭 Keyword Pioneer — one-shot voice conversion
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio