PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Kentaro Mitsui; Koh Mitsuda; Toshiaki Wakatsuki; Yukiya Hono; Kei Sawada

2024 EMNLP EMNLP 2024

PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems

Abstract

AbstractMultimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kentaro Mitsui , Koh Mitsuda , Toshiaki Wakatsuki , Yukiya Hono , Kei Sawada

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Generation > Dialogue Systems Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Multi-Modal Learning Speech & Audio > Synthesis > Speech Synthesis

Keywords

speech synthesis multimodal learning spoken dialogue system speech generation low-latency inference parallel generation large language model

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024