2024 INTERSPEECH INTERSPEECH 2024

An Uyghur Extension to the MASSIVE Multi-lingual Spoken Language Understanding Corpus with Comprehensive Evaluations

Abstract

Spoken Language Understanding (SLU) plays a crucial role in task-oriented dialogues, and the development of SLU in various languages has been rapid. However, progress in Uyghur SLU research has been slow due to the lack of publicly available datasets. To address this issue, we extend the MASSIVE dataset to include Uyghur language, thus creating the first Uyghur SLU dataset, MASSIVE-UG. After incorporating MASSIVE-UG, the average overall accuracy of the other 51 languages has improved, demonstrating the reliability of the dataset constructed in this paper. Considering the agglutinative nature of Uyghur, we segmented it into stem and affix and conducted experiments using different embedding methods and multiple baselines. The experimental results indicate that the performance of Uyghur SLU is influenced by several factors, including representation, embedding, and modeling approach. The dataset and code are available at https://github.com/xjuspeech/MASSIVE-UG.

🧭 Keyword Pioneer — uyghur language
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio