Attention-Based Speaker Embeddings for One-Shot Voice Conversion

Tatsuma Ishihara; Daisuke Saito

2020 INTERSPEECH INTERSPEECH 2020

Attention-Based Speaker Embeddings for One-Shot Voice Conversion

Abstract

This paper proposes a novel approach to embed speaker information to feature vectors at frame level using an attention mechanism, and its application to one-shot voice conversion. A one-shot voice conversion system is a type of voice conversion system where only one utterance from a target speaker is available for conversion. In many one-shot voice conversion systems, a speaker encoder mechanism compresses an utterance of the target speaker into a fixed-size vector for propagating speaker information. However, the obtained representation has lost temporal information related to speaker identities and it could degrade conversion quality. To alleviate this problem, we propose a novel way to embed speaker information using an attention mechanism. Instead of compressing into a fixed-size vector, our proposed speaker encoder outputs a sequence of speaker embedding vectors. The obtained sequence is selectively combined with input frames of a source speaker by an attention mechanism. Finally the obtained time varying speaker information is utilized for a decoder to generate the converted features. Objective evaluation showed that our method reduced the averaged mel-cepstrum distortion to 5.23 dB from 5.34 dB compared with the baseline system. The subjective preference test showed that our proposed system outperformed the baseline one.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — one-shot voice conversion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Tatsuma Ishihara , Daisuke Saito

Topics

Machine Learning > Core Methods > Embedding Learning Deep Learning > Architectures > Transformers Deep Learning > Architectures > Neural Networks Speech & Audio > Synthesis > Speech Enhancement Speech & Audio > Analysis > Speaker Verification Machine Learning > Learning Paradigms > Few-Shot Learning

Keywords

attention mechanism temporal information speaker embedding one-shot voice conversion speaker encoder frame-level vector

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020