2024 INTERSPEECH INTERSPEECH 2024

Mobile PresenTra: NICT fast neural text-to-speech system on smartphones with incremental inference of MS-FC-HiFi-GAN for law-latency synthesis

Abstract

For achieving fast and high-fidelity neural text-to-speech on edge smartphone devices without network connection, we NICT prototyped Mobile PresenTra by introducing non-autoregressive acoustic model with Transformer encoder and ConvNeXt decoder, and MS-FC-HiFi-GAN neural vocoder. Additionally, the incremental inference is applied only to neural vocoder for low-latency synthesis without performance degradation. Compared with a previous NICT system with Transformer encoder, Transforme decoder and MS-HiFi-GAN neural vocoder, the proposed Mobile PresenTra can realize high-fidelity and fast synthesis on a middle-range smartphone with a real-time factor of about 0.3 for batch inference, and a latency of less than 0.5 s for incremental inference. In the Show & Tell, attendees can freely experience the demonstration of Mobile PresenTra systems implemented on actual smartphones for English, Japanese and Chinese with arbitrary text input.

πŸŒ‰ Interdisciplinary Bridge β€” Deep Learning and Machine Learning and Speech & Audio
🧭 Keyword Pioneer β€” incremental inference
🐝 Cross-Pollinator β€” Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio
🐣 Hot Topic Early Bird β€” edge deployment