SiTa - Sinhala and Tamil Speaker Diarization Dataset in the Wild

Uthayasanker Thayasivam; Thulasithan Gnanenthiram; Shamila Jeewantha; Upeksha Jayawickrama

2025 COLING COLING 2025

SiTa - Sinhala and Tamil Speaker Diarization Dataset in the Wild

Abstract

AbstractThe dynamic field of speaker diarization continues to present significant challenges, despite notable advancements in recent years and the rising focus on complex acoustic scenarios emphasizes the importance of sustained research efforts in this area. While speech resources for speaker diarization are expanding rapidly, aided by semi-automated techniques, many existing datasets remain outdated and lack authentic real-world conversational data. This challenge is particularly acute for low-resource South Asian languages, due to limited public media data and reduced research efforts. Sinhala and Tamil are two such languages with limited speaker diarization datasets. To address this gap, we introduce a new speaker diarization dataset for these languages and evaluate multiple existing models to assess their performance. This work provides essential resources, a novel dataset and valuable insights from model benchmarks to advance speaker diarization for low-resource languages, particularly Sinhala and Tamil.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Uthayasanker Thayasivam , Thulasithan Gnanenthiram , Shamila Jeewantha , Upeksha Jayawickrama

Topics

Machine Learning > Application Areas > Domain Adaptation Speech & Audio > Recognition > Speaker Recognition Speech & Audio > Analysis > Speaker Verification Machine Learning > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Supervised Learning

Keywords

speech processing speaker recognition speaker diarization low-resource language tamil language sinhala language

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025