2020 INTERSPEECH INTERSPEECH 2020

Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation

Abstract

Solving the cocktail party problem with the multi-modal approach has become popular in recent years. Humans can focus on the speech that they are interested in for the multi-talker mixed speech, by hearing the mixed speech, watching the speaker, and understanding the context what the speaker is talking about. In this paper, we try to solve the speaker-independent speech separation problem with all three audio-visual-contextual modalities at the first time, and those are hearing speech, watching speaker and understanding contextual language. Compared to the previous methods applying pure audio modal or audio-visual modals, a specific model is further designed to extract contextual language information for all target speakers directly from the speech mixture. Then these extracted contextual knowledge are further incorporated into the multi-modal based speech separation architecture with an appropriate attention mechanism. The experiments show that a significant performance improvement can be observed with the newly proposed audio-visual-contextual speech separation.

🧭 Keyword Pioneer — contextual language
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio