2023 INTERSPEECH INTERSPEECH 2023

Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach

Abstract

This paper considers applying speaker diarization (SD) to the output tokens of automatic speech recognition (ASR). We formulate the task to be solved as a sequence classification problem, where we estimate the correct speaker label for each ASR output token based on a sequence of token-level speaker embeddings and candidate speaker profiles. To leverage the information from the ASR model, we utilize a recently proposed t-vector for the speaker embedding estimation. Whereas previous studies performed t-vector classification using cosine similarities with ad hoc post-processing, we propose to use a sequence classification model to leverage the sequential nature of the task more effectively. To handle a variable number of speakers, we use a classification model inspired by a target speaker voice activity detection based on transformers. We conduct experiments using the AMI meeting corpus in both speaker identification and diarization settings and show the effectiveness of our approach.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🧭 Keyword Pioneer — t-vector embedding
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio