2024 INTERSPEECH INTERSPEECH 2024

Context-Aware Speech Recognition Using Prompts for Language Learners

Abstract

We aim to enhance automatic speech recognition (ASR) systems with context-aware prompts, improving accuracy without needing complex domain-specific language models or fine-tuning. This is particularly valuable for spoken language learning, where instruction/assessment apps often present short spoken texts to elicit spoken responses. These elicitors reduce the range of expected, sensible spoken responses. Prompting ASR engines (Whisper and Gemini Audio) with an utterance's elicitor yields context-awareness and significantly improves performance. In two L2 English datasets, using elicitor texts as prompts improved Whisper and Gemini accuracy by up to 24.0% (relative WER). For one activity type, the elicitor text reduces errors in target words by half. Out-of-domain, prompt-enhanced Gemini bettered a conventional ASR system trained on in-domain data by 35.3% (relative WER); enhanced Whisper bettered it by 21.3%.

🌉 Interdisciplinary Bridge — Natural Language Processing and Speech & Audio
🧭 Keyword Pioneer — context-aware prompting
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

Authors