A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective

Ravi Shankar; Jacob Sager; Archana Venkataraman

2019 INTERSPEECH INTERSPEECH 2019

A Multi-Speaker Emotion Morphing Model Using Highway Networks and Maximum Likelihood Objective

Abstract

We introduce a new model for emotion conversion in speech based on highway neural networks. Our model uses the contextual pitch, energy and spectral information of a source emotional utterance to predict the framewise fundamental frequency and signal intensity under a target emotion. We also incorporate a latent gender representation to promote cross-speaker generalizability. Our neural network is trained to maximize the error log-likelihood under an assumed Laplacian distribution. We validate our model on the VESUS repository collected at Johns Hopkins University, which contains parallel emotional utterances from 10 actors across 5 emotional classes. The proposed algorithm outperforms three state-of-the-art baselines in terms of the mean absolute error and correlation between the predicted and target values. We evaluate the quality of our emotion manipulations via crowd-sourcing. Finally, we apply our emotion morphing model to utterances generated by Wavenet to demonstrate our unique ability to inject emotion into synthetic speech.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — emotion morphing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🐣 Hot Topic Early Bird — maximum likelihood

Authors

Ravi Shankar , Jacob Sager , Archana Venkataraman

Topics

Machine Learning > Core Methods > Regression Deep Learning > Architectures > Neural Networks Speech & Audio > Synthesis > Speech Enhancement Machine Learning > Learning Types > Supervised Learning

Keywords

speech synthesis fundamental frequency maximum likelihood maximum likelihood estimation latent representation highway network emotion morphing speech emotion conversion

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019