What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

Xiaoran Fan; Zhichao Sun; Yangfan Gao; Jingfei Xiong; Hang Yan; Yifei Cao; Jiajun Sun; Shuo Li; Zhihao Zhang; Zhiheng Xi; Yuhao Zhou; Senjie Jin; Changhao Jiang; Junjie Ye; Ming Zhang; Rui Zheng; Zhenhua Han; Yunke Zhang; Demei Yan; Shaokang Dong; Tao Ji; Tao Gui

2026 AAAI AAAI 2026

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study

Abstract

Abstract Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation. However, challenges remain in achieving effective cross-modal alignment and high-quality speech generation. In this work, we systematically investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling. We compare coupled, semi-decoupled, and fully decoupled speech tokenizers under a fair SLM framework and find that decoupled tokenization significantly improves alignment and synthesis quality. To address the information density mismatch between speech and text, we introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens. This leads to up to 12× faster decoding and a substantial drop in word error rate (from 6.07 to 3.01). Furthermore, we propose a speaker-aware generation paradigm and introduce RoleTriviaQA, a large-scale role-playing knowledge QA benchmark with diverse speaker identities. Experiments demonstrate that our methods enhance both knowledge understanding and speaker consistency.

👥 Mega-Team — 22 authors

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Xiaoran Fan , Zhichao Sun , Yangfan Gao , Jingfei Xiong , Hang Yan , Yifei Cao , Jiajun Sun , Shuo Li , Zhihao Zhang , Zhiheng Xi , Yuhao Zhou , Senjie Jin , Changhao Jiang , Junjie Ye , Ming Zhang , Rui Zheng , Zhenhua Han , Yunke Zhang , Demei Yan , Shaokang Dong , Tao Ji , Tao Gui

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Generation > Language Modeling

Keywords

cross-modal alignment speech language model speech generation multi-token prediction speaker modeling speech tokenizer

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026