RoDia: A New Dataset for Romanian Dialect Identification from Speech

Codruț Rotaru; Nicolae Cӑtӑlin Ristea; Radu Tudor Ionescu

2024 NAACL NAACL 2024

RoDia: A New Dataset for Romanian Dialect Identification from Speech

Abstract

AbstractWe introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — roman dialect

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Codruț Rotaru , Nicolae Cӑtӑlin Ristea , Radu Tudor Ionescu

Topics

Machine Learning > Core Methods > Classification Speech & Audio > Recognition > Speech Recognition

Keywords

automatic speech recognition speech classification low-resource language dialect identification roman dialect

Download PDF

Related papers

Working Alliance Transformer for Psychotherapy Dialogue Classification 2024

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences 2024

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study 2024

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation 2024

Extractive Summarization with Text Generator 2024