Optimizing Large-Scale Context Retrieval for End-to-End ASR

Zhiqi Huang; Diamantino Caseiro; Kandarp Joshi; Christopher Li; Pat Rondon; Zelin Wu; Petr Zadrazil; Lillian Zhou

2024 INTERSPEECH INTERSPEECH 2024

Optimizing Large-Scale Context Retrieval for End-to-End ASR

Abstract

Contextual Automatic Speech Recognition (ASR) requires scalable and accurate retrieval of content relevant to the user’s context. This paper presents a comparative study of two independent context retrieval methods: sequence and segment level scoring. Evaluated on datasets with up to 100k phrases, all methods exhibit excellent retrieval recall. Notably, the segment-level scoring achieves an outstanding 75.6% recall over 100k entities. When each method is further integrated with ASR through joint training, significant improvements over nonbiased ASR are observed, with WER reduction of up to 36% with 2k entities and 28% with 100k entities. This comparative analysis provides valuable insights for selecting the optimal context retrieval technique to achieve scalable and accurate performance in contextual ASR applications.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — context retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Zhiqi Huang , Diamantino Caseiro , Kandarp Joshi , Christopher Li , Pat Rondon , Zelin Wu , Petr Zadrazil , Lillian Zhou

Topics

Machine Learning > Core Methods > Representation Learning Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

context retrieval entity retrieval end-to-end asr segment-level scoring sequence scoring contextual asr

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024