Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems

Hyun-Wook Yoon; Ohsung Kwon; Hoyeon Lee; Ryuichi Yamamoto; Eunwoo Song; Jae-Min Kim; Min-Jae Hwang

2022 INTERSPEECH INTERSPEECH 2022

Language Model-Based Emotion Prediction Methods for Emotional Speech Synthesis Systems

Abstract

This paper proposes an effective emotional text-to-speech (TTS) system with a pre-trained language model (LM)-based emotion prediction method. Unlike conventional systems that require auxiliary inputs such as manually defined emotion classes, our system directly estimates emotion-related attributes from the input text. Specifically, we utilize generative pre-trained transformer (GPT)-3 to jointly predict both an emotion class and its strength in representing emotions' coarse and fine properties, respectively. Then, these attributes are combined in the emotional embedding space and used as conditional features of the TTS model for generating output speech signals. Consequently, the proposed system can produce emotional speech only from text without any auxiliary inputs. Furthermore, because the GPT-3 enables to capture emotional context among the consecutive sentences, the proposed method can effectively handle the paragraph-level generation of emotional speech.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hyun-Wook Yoon , Ohsung Kwon , Hoyeon Lee , Ryuichi Yamamoto , Eunwoo Song , Jae-Min Kim , Min-Jae Hwang

Topics

Machine Learning > Learning Types > Self-Supervised Learning Speech & Audio > Synthesis > Text-to-Speech Speech & Audio > Analysis > Prosody Analysis

Keywords

prosody analysis language model generative pre-trained transformer emotion prediction emotional speech synthesis

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022