EEND-M2F: Masked-attention mask transformers for speaker diarization

Marc Harkonen; Samuel J. Broughton; Lahiru Samarakoon

2024 INTERSPEECH INTERSPEECH 2024

EEND-M2F: Masked-attention mask transformers for speaker diarization

Abstract

In this paper, we make the explicit connection between image segmentation methods and end-to-end diarization methods. From these insights, we propose a novel, fully end-to-end diarization model, EEND-M2F, based on the Mask2Former architecture. Speaker representations are computed in parallel using a stack of transformer decoders, in which irrelevant frames are explicitly masked from the cross attention using predictions from previous layers. EEND-M2F is efficient, and truly end-to-end, eliminating the need for additional segmentation models or clustering algorithms. Our model achieves state-of-the-art performance on several public datasets, such as AMI, AliMeeting and RAMC. Most notably our DER of 16.07% on DIHARD-III is the first major improvement upon the challenge winning system.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — end-to-end diarization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Marc Harkonen , Samuel J. Broughton , Lahiru Samarakoon

Topics

Artificial Intelligence > Core AI > Multi-Agent Systems Deep Learning > Architectures > Transformers Computer Vision > Analysis > Object Detection Speech & Audio > Analysis > Speaker Verification Deep Learning > Learning Types > Self-Supervised Learning

Keywords

speaker embedding speaker diarization transformer decoder end-to-end model speaker representation end-to-end diarization masked-attention transformer mask attention

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024