Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Kun Zhou; Shengkui Zhao; Yukun Ma; Chong Zhang; Hao Wang; Dianwen Ng; Chongjia Ni; Trung Hieu Nguyen; Jia Qi Yip; Bin Ma

2024 INTERSPEECH INTERSPEECH 2024

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Abstract

Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model. Subsequently, a non-autoregressive model is employed to predict discrete acoustic codecs that contain fine-grained acoustic details. The TTS model focuses solely on linguistic modeling during autoregressive training, thereby reducing the error propagation that occurs in non-autoregressive training. Both objective and subjective evaluations validate the effectiveness of our proposed method.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Kun Zhou , Shengkui Zhao , Yukun Ma , Chong Zhang , Hao Wang , Dianwen Ng , Chongjia Ni , Trung Hieu Nguyen , Jia Qi Yip , Bin Ma

Topics

Natural Language Processing > Generation > Language Modeling Natural Language Processing > Generation > Text Generation

Keywords

language modeling self-supervised representation text-to-speech synthesis phonetic representation autoregressive language model

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024