Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks

Takuhiro Kaneko; Hirokazu Kameoka; Kaoru Hiramatsu; Kunio Kashino

2017 INTERSPEECH INTERSPEECH 2017

Sequence-to-Sequence Voice Conversion with Similarity Metric Learned Using Generative Adversarial Networks

Abstract

We propose a training framework for sequence-to-sequence voice conversion (SVC). A well-known problem regarding a conventional VC framework is that acoustic-feature sequences generated from a converter tend to be over-smoothed, resulting in buzzy-sounding speech. This is because a particular form of similarity metric or distribution for parameter training of the acoustic model is assumed so that the generated feature sequence that averagely fits the training target example is considered optimal. This over-smoothing occurs as long as a manually constructed similarity metric is used. To overcome this limitation, our proposed SVC framework uses a similarity metric implicitly derived from a generative adversarial network, enabling the measurement of the distance in the high-level abstract space. This would enable the model to mitigate the over-smoothing problem caused in the low-level data space. Furthermore, we use convolutional neural networks to model the long-range context-dependencies. This also enables the similarity metric to have a shift-invariant property; thus, making the model robust against misalignment errors involved in the parallel data. We tested our framework on a non-native-to-native VC task. The experimental results revealed that the use of the proposed framework had a certain effect in improving naturalness, clarity, and speaker individuality.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — similarity metric

🐣 Hot Topic Early Bird — generative adversarial network

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Takuhiro Kaneko , Hirokazu Kameoka , Kaoru Hiramatsu , Kunio Kashino

Topics

Machine Learning > Core Methods > Metric Learning Machine Learning > Learning Types > Adversarial Learning Deep Learning > Models > Generative Models Speech & Audio > Synthesis > Speech Enhancement Deep Learning > Learning Types > Adversarial Learning

Keywords

voice conversion convolutional neural network sequence-to-sequence learning generative adversarial network similarity metric

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017