Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Ivan Medennikov; Maxim Korenevsky; Tatiana Prisyach; Yuri Khokhlov; Mariya Korenevskaya; Ivan Sorokin; Tatiana Timofeeva; Anton Mitrofanov; Andrei Andrusenko; Ivan Podluzhny; Aleksandr Laptev; Aleksei Romanenko

2020 INTERSPEECH INTERSPEECH 2020

Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Abstract

Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

🧭 Keyword Pioneer — target-speaker voice activity detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio

Authors

Ivan Medennikov , Maxim Korenevsky , Tatiana Prisyach , Yuri Khokhlov , Mariya Korenevskaya , Ivan Sorokin , Tatiana Timofeeva , Anton Mitrofanov , Andrei Andrusenko , Ivan Podluzhny , Aleksandr Laptev , Aleksei Romanenko

Topics

Machine Learning > Core Methods > Classification

Keywords

binary classification speaker diarization overlapping speech diarization error rate multi-speaker detection target-speaker voice activity detection

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020