Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Hanzhao Li; Liumeng Xue; Haohan Guo; Xinfa Zhu; Yuanjun Lv; Lei Xie; Yunlin Chen; Hao Yin; Zhifei Li

2024 INTERSPEECH INTERSPEECH 2024

Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation

Abstract

The multi-codebook speech codec enables the application of large language models (LLM) in TTS but bottlenecks efficiency and robustness due to multi-sequence prediction. To avoid this obstacle, we propose Single-Codec, a singlecodebook single-sequence codec, which employs a disentangled VQ-VAE to decouple speech into a time-invariant embedding and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a BLSTM module to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single- Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-TTS experiments, showing improved naturalness and intelligibility.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hanzhao Li , Liumeng Xue , Haohan Guo , Xinfa Zhu , Yuanjun Lv , Lei Xie , Yunlin Chen , Hao Yin , Zhifei Li

Topics

Artificial Intelligence > Core AI > Multimodal Learning

Keywords

vector quantization speech generation neural codec speech reconstruction speech codec discrete latent representation

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024