2019 INTERSPEECH INTERSPEECH 2019

Duration Modeling with Global Phoneme-Duration Vectors

Abstract

A duration model is a major component in every parametric speech synthesis system. Conventional methods use full contextual labels as features to predict phoneme durations that require morphological analysis of text. By contrast, advances in bidirectional recurrent neural networks (BRNN) and global space vector models make it possible to perform grapheme-to-phoneme (G2P) conversion from plain text. In this paper, we investigate duration prediction from plain phonemes instead of using their full contextual labels. We propose a new approach that relies on both BRNN and global space vector representations of phonemes (GPV) and durations (GDV). GPVs represent the statistics of phonemes used in a language, whereas GDVs capture duration variations beyond linguistic features. They are essentially learned from a large-scale text corpus in an unsupervised manner where phonemes are converted by G2P. We conducted experiments on two speech corpora in Korean and Chinese to train BRNN-based models in a supervised manner. An objective evaluation conducted on a set of test sentences demonstrated that the proposed method leads to more accurate modeling of phoneme durations than the baselines.

🌉 Interdisciplinary Bridge — Deep Learning and Interdisciplinary
🧭 Keyword Pioneer — global vector
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio