Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments

Yiyu Luo; Jing Wang; Liang Xu; Lidong Yang

2021 INTERSPEECH INTERSPEECH 2021

Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments

Abstract

Speech separation is the task of extracting target speech from noisy mixture. In applications like video telephones or video conferencing, lip movements of the target speaker are accessible, which can be leveraged for speech separation. This paper proposes a time-domain audio-visual speech separation model under multi-talker environments. The model receives audio-visual inputs including noisy mixture and speaker lip embedding, and reconstructs clean speech waveform for the target speaker. Once trained, the model can be flexibly applied to unknown number of total speakers. This paper introduces and investigates the multi-stream gating mechanism and pyramidal convolution in temporal convolutional neural networks for audio-visual speech separation task. Speaker- and noise-independent multi-talker separation experiments are conducted on GRID benchmark dataset. The experimental results demonstrate the proposed method achieves 3.9 dB and 1.0 dB SI-SNRi improvement when compared with audio-only and audio-visual baselines respectively, showing effectiveness of the proposed method.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — multi-talker environment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Yiyu Luo , Jing Wang , Liang Xu , Lidong Yang

Topics

Deep Learning > Architectures > Neural Networks Deep Learning > Techniques > Model Architecture Computer Vision > Processing > Video Processing Speech & Audio > Analysis > Speech Enhancement Deep Learning > Learning Types > Multi-Modal Learning

Keywords

speech separation speaker embedding lip movement speaker separation temporal convolutional network multi-talker environment audio-visual processing multi-talker separation audio-visual speech separation

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021