2023 INTERSPEECH INTERSPEECH 2023

Improving the response timing estimation for spoken dialogue systems by reducing the effect of speech recognition delay

Abstract

In conversational systems, the proper timing of the system's response is critical to maintaining a comfortable conversation. To achieve appropriate timing estimation, it is important to know what the users have said, including their most recent words, but ASR delay usually prevents the use of full user utterance. In this paper, we attempted to employ an extremely low latency ASR model called Multi-Look-Ahead ASR by Zhao et al. to enable near full utterance for response timing estimation. Additionally, we examined the effectiveness of using low latency ASR in combination with a parameter called Estimates of Syntactic Completeness (ESC), which indicates how soon the user's speech is completed. We evaluated on a Japanese simulated dialog database of a restaurant information center. The results confirmed that reducing ASR delay improves the accuracy of response timing estimation. This effect also appeared when the method using ESC is combined with the use of low latency ASR.

🌉 Interdisciplinary Bridge — Natural Language Processing and Speech & Audio
🧭 Keyword Pioneer — response timing
🐣 Hot Topic Early Bird — dialogue system
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio