Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation

Jisi Zhang; Cătălin Zorilă; Rama Doddipatla; Jon Barker

2021 INTERSPEECH INTERSPEECH 2021

Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation

Abstract

In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and further improved via model distillation. Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem observed in the original MixIT method. Further, the semi-supervised performance is comparable to a fully-supervised separation system trained using ten times the amount of supervised data.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — mixture invariant training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Robotics, Speech & Audio

🐣 Hot Topic Early Bird — model distillation

Authors

Jisi Zhang , Cătălin Zorilă , Rama Doddipatla , Jon Barker

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Semi-Supervised Learning

Keywords

semi-supervised learning speech separation model distillation teacher-student training permutation invariant training mixture invariant training

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021