Learning Better Speech Representations by Worsening Interference

Jun Wang

2020 INTERSPEECH INTERSPEECH 2020

Learning Better Speech Representations by Worsening Interference

Abstract

Can better representations be learnt from worse interfering scenarios? To verify this seeming paradox, we propose a novel framework that performed compositional learning in traditionally independent tasks of speech separation and speaker identification. In this framework, generic pre-training and compositional fine-tuning are proposed to mimic the bottom-up and top-down processes of a human’s cocktail party effect. Moreover, we investigate schemes to prohibit the model from ending up learning an easier identity-prediction task. Substantially discriminative and generalizable representations can be learnt in severely interfering conditions. Experiment results on downstream tasks show that our learnt representations have superior discriminative power than a standard speaker verification method. Meanwhile, RISE achieves higher SI-SNRi consistently in different inference modes over DPRNN, a state-of-the-art speech separation system.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐣 Hot Topic Early Bird — self-supervised learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Jun Wang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Self-Supervised Learning Speech & Audio > Analysis > Speaker Verification Machine Learning > Learning Types > Multi-Task Learning Machine Learning > Learning Types > Representation Learning

Keywords

representation learning speech separation self-supervised learning multimodal learning speaker verification speaker identification cocktail party effect

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020