Disentangling prosody and timbre embeddings via voice conversion

Nicolas Gengembre; Olivier Le Blouch; Cedric Gendrot

2024 INTERSPEECH INTERSPEECH 2024

Disentangling prosody and timbre embeddings via voice conversion

Abstract

Modern voice conversion and anonymization architectures generally share a design preserving source linguistic content and expressivity while modifying speaker timbre characteristics. This approach leads to a converted signal quite perfectly synchronized with the source signal. In this paper, we hypothesize that this paradigm can help us to quantify the amount of speaker identity preserved in converted voice, refered here as prosody (including speech melody and rhythm). Based on this observation, we propose a method to split and disentangle speaker representation into complementary embeddings conveying respectively prosodic and timbre information. Additionally, we propose a method to evaluate prosody preservation in standard voice privacy architectures and we validate the power of prosodic and timbre embeddings to detect related voice attributes.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — timbre embedding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Nicolas Gengembre , Olivier Le Blouch , Cedric Gendrot

Topics

Machine Learning > Core Methods > Representation Learning Speech & Audio > Analysis > Prosody Analysis

Keywords

voice conversion prosody analysis speaker identity speaker representation timbre embedding embedding disentanglement

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024