2023 INTERSPEECH INTERSPEECH 2023

Robust Feature Decoupling in Voice Conversion by Using Locality-Based Instance Normalization

Abstract

Extensive style transfer methods have shown that instance normalization (IN) is a simple yet effective way to remove style information. However, few studies have focused on whether these channel-wise feature statistics, such as mean and standard deviation (std) are consistent locally and globally, which ultimately leads to insufficient feature decoupling. In this paper, we first propose locality-based instance normalization (LoIN) to impose statistical feature consistency constraints on latent feature maps. LoIN performs normalization using local feature statistics which are calculated on randomly selected frames rather than the entire set of frames used in the training phase. In particular, LoIN is lightweight, less computationally intensive, and transferable to any IN-driven VC method. Experimental results show the superiority of LoIN in disentanglement and transfer performance and show improvement in both speaker similarity and content consistency.

🧭 Keyword Pioneer — locality-based instance normalization
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio
🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio