2024 INTERSPEECH INTERSPEECH 2024

Disentangling prosody and timbre embeddings via voice conversion

Abstract

Modern voice conversion and anonymization architectures generally share a design preserving source linguistic content and expressivity while modifying speaker timbre characteristics. This approach leads to a converted signal quite perfectly synchronized with the source signal. In this paper, we hypothesize that this paradigm can help us to quantify the amount of speaker identity preserved in converted voice, refered here as prosody (including speech melody and rhythm). Based on this observation, we propose a method to split and disentangle speaker representation into complementary embeddings conveying respectively prosodic and timbre information. Additionally, we propose a method to evaluate prosody preservation in standard voice privacy architectures and we validate the power of prosodic and timbre embeddings to detect related voice attributes.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🧭 Keyword Pioneer — timbre embedding
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio