Context-Aware Speech Recognition Using Prompts for Language Learners

Jian Cheng

2024 INTERSPEECH INTERSPEECH 2024

Context-Aware Speech Recognition Using Prompts for Language Learners

Abstract

We aim to enhance automatic speech recognition (ASR) systems with context-aware prompts, improving accuracy without needing complex domain-specific language models or fine-tuning. This is particularly valuable for spoken language learning, where instruction/assessment apps often present short spoken texts to elicit spoken responses. These elicitors reduce the range of expected, sensible spoken responses. Prompting ASR engines (Whisper and Gemini Audio) with an utterance's elicitor yields context-awareness and significantly improves performance. In two L2 English datasets, using elicitor texts as prompts improved Whisper and Gemini accuracy by up to 24.0% (relative WER). For one activity type, the elicitor text reduces errors in target words by half. Out-of-domain, prompt-enhanced Gemini bettered a conventional ASR system trained on in-domain data by 35.3% (relative WER); enhanced Whisper bettered it by 21.3%.

🌉 Interdisciplinary Bridge — Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — context-aware prompting

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Jian Cheng

Topics

Natural Language Processing > Applications > Intent Classification Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

automatic speech recognition language learning word error rate context-aware prompting elicitor text

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024