Multi-latency look-ahead for streaming speaker segmentation

Bilal Rahou; Hervé Bredin

2024 INTERSPEECH INTERSPEECH 2024

Multi-latency look-ahead for streaming speaker segmentation

Abstract

We address the task of streaming speaker diarization and propose several contributions to achieve a better trade-off between latency and accuracy. First, computational latency is reduced to its bare minimum by switching to a causal frame-wise speaker segmentation architecture. Then, a multi-latency look-ahead mechanism is used during training to support adaptive latency during inference at no additional computational cost. Finally, we detail the method used during inference to achieve the final frame-wise segmentation. We evaluate the impact of these contributions on the AMI meeting dataset with a focus on the speaker segmentation step, seen through the prism of voice activity detection, overlapped speech detection and speaker change detection.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Machine Learning

🧭 Keyword Pioneer — multi-latency look-ahead

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Robotics, Security & Privacy, Speech & Audio

Authors

Bilal Rahou , Hervé Bredin

Topics

Machine Learning > Core Methods > Classification Data Science & Analytics > Applications > Clustering

Keywords

speaker diarization voice activity detection speaker change detection speaker segmentation overlapped speech detection multi-latency look-ahead

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024