UrHiOdSynth: A Multilingual Synthetic Corpus for Speech-to-Speech Translation in Low-Resource Indic Languages

Jamaluddin; Subhankar Panda; Aditya Narendra; Kamanksha Prasad Dubey; Mohammad Nadeem

2026 EACL EACL 2026

UrHiOdSynth: A Multilingual Synthetic Corpus for Speech-to-Speech Translation in Low-Resource Indic Languages

Abstract

AbstractSpeech-to-Speech Translation (S2ST) focuses on generating spoken output in a target language directly from spoken input in a source language. Despite progress in S2ST modeling, low-resource Indic languages remain poorly supported, primarily because large-scale parallel speech corpora are unavailable. We present UrHiOdSynth, a three-language parallel S2ST dataset containing approximately 75 hours of speech across Urdu, Hindi, and Odia. The corpus consists of 10,735 aligned sentence triplets, with an average utterance length of 8.45 seconds. To our knowledge, UrHiOdSynth represents the largest multi-domain resource offering aligned speech and text for S2ST in this language context. Beyond speech-to-speech translation, the dataset supports tasks such as automatic speech recognition, speech-to-text translation, text-to-speech synthesis, and machine translation. This flexibility enables the training of unified multilingual models, particularly for low-resource Indic languages.

🧭 Keyword Pioneer — low-resource indic language

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Jamaluddin , Subhankar Panda , Aditya Narendra , Kamanksha Prasad Dubey , Mohammad Nadeem

Topics

Speech & Audio > Synthesis > Speech Enhancement

Keywords

machine translation speech synthesis parallel corpus speech-to-speech translation low-resource indic language

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026