VocalNet: Speech LLMs with Multi-Token Prediction for Faster and High-Quality Generation

Yuhao Wang; Heyang Liu; Ziyang Cheng; Ronghua Wu; Qunshan Gu; Yanfeng Wang; Yu Wang

2025 EMNLP EMNLP 2025

VocalNet: Speech LLMs with Multi-Token Prediction for Faster and High-Quality Generation

Abstract

AbstractSpeech large language models (LLMs) have emerged as a prominent research focus in speech processing. In this work, we introduce VocalNet, a series of high-performance speech LLMs featuring a scalable and model-agnostic training framework as well as a novel multi-token prediction (MTP) paradigm for speech generation. We first propose an efficient two-stage training framework that enables LLMs to acquire real-time speech interaction capabilities. Through extensive experiments on various training configurations, we ensure both simplicity and effectiveness in the training strategy. Furthermore, inspired by advances in language modeling, we introduce MTP into the domain of speech LLMs—an alternative to traditional next-token prediction (NTP)—which enables the model to predict multiple future tokens at each step. Through systematic analysis and improved implementation, we show that MTP not only accelerates inference speed but also significantly enhances speech quality. Experimental results demonstrate that VocalNet achieves performance comparable to state-of-the-art Omni LLMs while outperforming existing open-source speech LLMs, despite using limited training data.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — speech llm

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuhao Wang , Heyang Liu , Ziyang Cheng , Ronghua Wu , Qunshan Gu , Yanfeng Wang , Yu Wang

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Optimization & Theory > Optimization Natural Language Processing > Resources & Methods > Large Language Models Speech & Audio > Synthesis Artificial Intelligence > Core AI > Large Language Models Speech & Audio > Synthesis > Speech Synthesis

Keywords

speech synthesis speech processing next-token prediction inference speed speech generation multi-token prediction speech quality two-stage training large language model speech llm

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025