Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions

Anfeng Xu; Kevin Huang; Tiantian Feng; Lue Shen; Helen Tager-Flusberg; Shrikanth Narayanan

2024 INTERSPEECH INTERSPEECH 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions

Abstract

Speech foundation models, trained on vast datasets, have opened unique opportunities in addressing challenging low-resource speech understanding, such as child speech. In this work, we explore the capabilities of speech foundation models on child-adult speaker diarization. We show that exemplary foundation models can achieve 39.5% and 62.3% relative reductions in Diarization Error Rate and Speaker Confusion Rate, respectively, compared to previous speaker diarization methods. In addition, we benchmark and evaluate the speaker diarization results of the speech foundation models with varying the input audio window size, speaker demographics, and training data ratio. Our results highlight promising pathways for understanding and adopting speech foundation models to facilitate child speech understanding.

🧭 Keyword Pioneer — speaker confusion rate

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Anfeng Xu , Kevin Huang , Tiantian Feng , Lue Shen , Helen Tager-Flusberg , Shrikanth Narayanan

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning

Keywords

speaker diarization speech foundation model child speech recognition diarization error rate speaker confusion rate

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

K-means and hierarchical clustering of f0 contours 2024

Backchannel prediction, based on who, when and what 2024