2024 INTERSPEECH INTERSPEECH 2024

Optimizing Large-Scale Context Retrieval for End-to-End ASR

Abstract

Contextual Automatic Speech Recognition (ASR) requires scalable and accurate retrieval of content relevant to the user’s context. This paper presents a comparative study of two independent context retrieval methods: sequence and segment level scoring. Evaluated on datasets with up to 100k phrases, all methods exhibit excellent retrieval recall. Notably, the segment-level scoring achieves an outstanding 75.6% recall over 100k entities. When each method is further integrated with ASR through joint training, significant improvements over nonbiased ASR are observed, with WER reduction of up to 36% with 2k entities and 28% with 100k entities. This comparative analysis provides valuable insights for selecting the optimal context retrieval technique to achieve scalable and accurate performance in contextual ASR applications.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🧭 Keyword Pioneer — context retrieval
🐝 Cross-Pollinator — Artificial Intelligence, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio