Towards End-to-End Spoken Dialogue Systems with Turn Embeddings
Abstract
Training task-oriented dialogue systems requires significant amount of manual effort and integration of many independently built components; moreover, the pipeline is prone to error-propagation. End-to-end training has been proposed to overcome these problems by training the whole system over the utterances of both dialogue parties. In this paper we present an end-to-end spoken dialogue system architecture that is based on turn embeddings. Turn embeddings encode a robust representation of user turns with a local dialogue history and they are trained using sequence-to-sequence models. Turn embeddings are trained by generating the previous and the next turns of the dialogue and additionally perform spoken language understanding. The end-to-end spoken dialogue system is trained using the pre-trained turn embeddings in a stateful architecture that considers the whole dialogue history. We observe that the proposed spoken dialogue system architecture outperforms the models based on local-only dialogue history and it is robust to automatic speech recognition errors.