Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents

Chaoran Liu; Carlos Ishi; Hiroshi Ishiguro

2017 INTERSPEECH INTERSPEECH 2017

Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents

Abstract

A natural conversation involves rapid exchanges of turns while talking. Taking turns at appropriate timing or intervals is a requisite feature for a dialog system as a conversation partner. This paper proposes a model that estimates the timing of turn-taking during verbal interactions. Unlike previous studies, our proposed model does not rely on a silence region between sentences since a dialog system must respond without large gaps or overlaps. We propose a Recurrent Neural Network (RNN) based model that takes the joint embedding of lexical and prosodic contents as its input to classify utterances into turn-taking related classes and estimates the turn-taking timing. To this end, we trained a neural network to embed the lexical contents, the fundamental frequencies, and the speech power into a joint embedding space. To learn meaningful embedding spaces, the prosodic features from each single utterance are pre-trained using RNN and combined with utterance lexical embedding as the input of our proposed model. We tested this model on a spontaneous conversation dataset and confirmed that it outperformed the use of word embedding-based features.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — joint embedding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Chaoran Liu , Carlos Ishi , Hiroshi Ishiguro

Topics

Machine Learning > Core Methods > Classification Machine Learning > Core Methods > Embedding Learning Deep Learning > Architectures > Neural Networks Speech & Audio > Analysis > Prosody Analysis Natural Language Processing > Applications > Dialogue Systems

Keywords

fundamental frequency recurrent neural network dialogue system joint embedding prosodic feature turn-taking estimation lexical content

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017