Phrase Break Prediction with Bidirectional Encoder Representations in Japanese Text-to-Speech Synthesis

Kosuke Futamata; Byeongseon Park; Ryuichi Yamamoto; Kentaro Tachibana

2021 INTERSPEECH INTERSPEECH 2021

Phrase Break Prediction with Bidirectional Encoder Representations in Japanese Text-to-Speech Synthesis

Abstract

We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a.k.a BERT, and explicit features extracted from BiLSTM with linguistic features. In conventional BiLSTM-based methods, word representations and/or sentence representations are used as independent components. The proposed method takes account of both representations to extract the latent semantics, which cannot be captured by previous methods. The objective evaluation results show that the proposed method obtains an absolute improvement of 3.2 points for the F1 score compared with BiLSTM-based conventional methods using linguistic features. Moreover, the perceptual listening test results verify that a TTS system that applied our proposed method achieved a mean opinion score of 4.39 in prosody naturalness, which is highly competitive with the score of 4.37 for synthesized speech with ground-truth phrase breaks.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — prosody naturalness

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Kosuke Futamata , Byeongseon Park , Ryuichi Yamamoto , Kentaro Tachibana

Topics

Deep Learning > Techniques > Pretraining Speech & Audio > Synthesis > Text-to-Speech

Keywords

pretrained language model text-to-speech synthesis bidirectional encoder phrase break prediction prosody naturalness

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021