STEN-TTS: Improving Zero-shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework

Chung Tran; Chi Mai Luong; Sakriani Sakti

2023 INTERSPEECH INTERSPEECH 2023

STEN-TTS: Improving Zero-shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework

Abstract

The prevalence of personalized multilingual tools plays an important role in learning aids and virtual assistants. The existing works on multilingual adaptive text-to-speech (TTS) mainly focus on fine-tuning models or extracting personal styles, such as prosody, emotion, and identity, with the aim of adapting to new speakers. This paper introduces the Style-Enhanced Normalization TTS (STEN-TTS) approach to synthesizing multilingual voice and maintaining personal styles with only 3 seconds of input reference. By presenting an integrated module (STEN) into the diffusion model, the proposed method can simulate the speaker's style and eliminate white noise in the synthesized speech. The experimental results show that our model achieves good performance, at above 3.5 on SMOS for cross-lingual switching. Furthermore, when using speaker verification to assess the similarity between the ground truth and synthesized voices, the accuracy reaches 82.4% with 3 seconds of audio reference.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Speech & Audio

🧭 Keyword Pioneer — speaker style transfer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio