2022 INTERSPEECH INTERSPEECH 2022

Cross-Cultural Comparison of Gradient Emotion Perception: Human vs. Alexa TTS Voices

Abstract

This study compares how American (US) and German (DE) listeners perceive emotional expressiveness from Amazon Alexa text-to-speech (TTS) and human voices. Participants heard identical stimuli, manipulated from an emotionally ‘neutral' production to three levels of increased happiness generated by resynthesis. Results show that, for both groups, ‘happiness' manipulations lead to higher ratings of emotional valence (i.e., more positive) for the human voice. Moreover, there was a difference across the groups in their perception of arousal (i.e., excitement): US listeners show higher ratings for human voices with manipulations, while DE listeners perceive the Alexa voice as sounding less ‘excited' overall. We discuss these findings in terms of theories of cross-cultural emotion perception and human-computer interaction.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Interdisciplinary
🧭 Keyword Pioneer — cross-cultural comparison
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio