Two-Pathway Style Embedding for Arbitrary Voice Conversion

Xuexin Xu; Liang Shi; Jinhui Chen; Xunquan Chen; Jie Lian; Pingyuan Lin; Zhihong Zhang; Edwin R. Hancock

2021 INTERSPEECH INTERSPEECH 2021

Two-Pathway Style Embedding for Arbitrary Voice Conversion

Abstract

Arbitrary voice conversion, also referred to as zero-shot voice conversion, has recently attracted increased attention in the literature. Although disentangling the linguistic and style representations for acoustic features is an effective way to achieve zero-shot voice conversion, the problem of how to convert to a natural speaker style is challenging because of the intrinsic variabilities of speech and the difficulties of completely decoupling them. For this reason, in this paper, we propose a Two-Pathway Style Embedding Voice Conversion framework (TPSE-VC) for realistic and natural speech conversion. The novel feature of this method is to simultaneously embed sentence-level and phoneme-level style information. A novel attention mechanism is proposed to implement the implicit alignment for timbre style and phoneme content, further embedding a phoneme-level style representation. In addition, we consider embedding the complete set of time steps of audio style into a fixed-length vector to obtain the sentence-level style representation. Moreover, TPSEVC does not require any pre-trained models, and is only trained with non-parallel speech data. Experimental results demonstrate that the proposed TPSE-VC outperforms the state-of-the-art results on zero-shot voice conversion.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — speaker style

🐣 Hot Topic Early Bird — zero-shot learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Xuexin Xu , Liang Shi , Jinhui Chen , Xunquan Chen , Jie Lian , Pingyuan Lin , Zhihong Zhang , Edwin R. Hancock

Topics

Machine Learning > Learning Types > Zero-Shot Learning Deep Learning > Architectures > Neural Networks Speech & Audio > Synthesis > Speech Enhancement Deep Learning > Learning Types > Deep Learning

Keywords

zero-shot learning attention mechanism voice conversion speaker embedding zero-shot voice conversion style embedding speaker style

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021