SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Dong Zhang; Shimin Li; Xin Zhang; Jun Zhan; Pengyu Wang; Yaqian Zhou; Xipeng Qiu

2023 EMNLP EMNLP 2023

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Abstract

AbstractMulti-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-modal content. With discrete speech representations, we construct SpeechInstruct, the first large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow cross-modal human instructions and highlight the potential of handling multiple modalities with one model. Code and models are available in https://github.com/0nutation/SpeechGPT. Demos are shown in https://0nutation.github.io/SpeechGPT.github.io/.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — cross-modal instruction

🐣 Hot Topic Early Bird — modality alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dong Zhang , Shimin Li , Xin Zhang , Jun Zhan , Pengyu Wang , Yaqian Zhou , Xipeng Qiu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Resources & Methods > Large Language Models Speech & Audio > Synthesis > Text-to-Speech Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Artificial Intelligence > Core AI > Multi-Modal Learning Speech & Audio > Processing > Speech Enhancement

Keywords

speech processing multi-modal learning instruction following modality alignment multimodal large language model speech generation discrete speech representation speech instruction large language model cross-modal instruction speech-language model cross-modal conversation

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023