CAUSE: Crossmodal Action Unit Sequence Estimation from Speech

Hirokazu Kameoka; Takuhiro Kaneko; Shogo Seki; Kou Tanaka

2022 INTERSPEECH INTERSPEECH 2022

CAUSE: Crossmodal Action Unit Sequence Estimation from Speech

Abstract

This paper proposes a task and method for estimating a sequence of facial action units (AUs) solely from speech. AUs were introduced in the facial action coding system to objectively describe facial muscle activations. Our motivation is that AUs can be useful continuous quantities for representing speaker's subtle emotional states, attitudes, and moods in a variety of applications such as expressive speech synthesis and emotional voice conversion. We hypothesize that the information about the speaker's facial muscle movements is expressed in the generated speech and can somehow be predicted from speech alone. To verify this, we devise a neural network model that predicts an AU sequence from the mel-spectrogram of input speech and train it using a large-scale audio-visual dataset consisting of many speaking face-tracks. We call our method and model ``crossmodal AU sequence estimation/estimator (CAUSE)''. We implemented several of the most basic architectures for CAUSE, and quantitatively confirmed that the fully convolutional architecture performed best. Furthermore, by combining CAUSE with an AU-conditioned image-to-image translation method, we implemented a system that animates a given still face image from speech. Using this system, we confirmed the potential usefulness of AUs as a representation of non-linguistic features via subjective evaluations.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🐣 Hot Topic Early Bird — facial expression

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Hirokazu Kameoka , Takuhiro Kaneko , Shogo Seki , Kou Tanaka

Topics

Machine Learning > Learning Types > Self-Supervised Learning Speech & Audio > Analysis > Clinical Speech Analysis

Keywords

facial expression crossmodal learning action unit fully convolutional

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022