Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for law-latency synthesis

Takuma Okamoto; Yamato Ohtani; Hisashi Kawai

2024 INTERSPEECH INTERSPEECH 2024

Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for law-latency synthesis

Abstract

For achieving fast and high-fidelity neural text-to-speech on edge smartphone devices without network connection, we NICT prototyped Mobile PresenTra by introducing non-autoregressive acoustic model with Transformer encoder and ConvNeXt decoder, and MS-FC-HiFi-GAN neural vocoder. Additionally, the incremental inference is applied only to neural vocoder for low-latency synthesis without performance degradation. Compared with a previous NICT system with Transformer encoder, Transforme decoder and MS-HiFi-GAN neural vocoder, the proposed Mobile PresenTra can realize high-fidelity and fast synthesis on a middle-range smartphone with a real-time factor of about 0.3 for batch inference, and a latency of less than 0.5 s for incremental inference. In the Show & Tell, attendees can freely experience the demonstration of Mobile PresenTra systems implemented on actual smartphones for English, Japanese and Chinese with arbitrary text input.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — incremental inference

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

🐣 Hot Topic Early Bird — edge deployment

Authors

Takuma Okamoto , Yamato Ohtani , Hisashi Kawai

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Techniques > Model Architecture Speech & Audio > Synthesis > Text-to-Speech Deep Learning > Models > Neural Networks

Keywords

edge deployment mobile deployment text-to-speech synthesis neural vocoder incremental inference low-latency synthesis real-time factor

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024