Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

Da-Rong Liu; Chunxi Liu; Frank Zhang; Gabriel Synnaeve; Yatharth Saraf; Geoffrey Zweig

2020 INTERSPEECH INTERSPEECH 2020

Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network Language Model

Abstract

Videos uploaded on social media are often accompanied with textual descriptions. In building automatic speech recognition (ASR) systems for videos, we can exploit the contextual information provided by such video metadata. In this paper, we explore ASR lattice rescoring by selectively attending to the video descriptions. We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model during lattice rescoring. Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata. We perform experimental evaluations on both language modeling and ASR tasks, and demonstrate that both proposed methods provide performance improvements by selectively leveraging the video metadata.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — video metadatum

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Da-Rong Liu , Chunxi Liu , Frank Zhang , Gabriel Synnaeve , Yatharth Saraf , Geoffrey Zweig

Topics

Machine Learning > Application Areas > Domain Adaptation Speech & Audio > Recognition > Automatic Speech Recognition Deep Learning > Learning Types > Deep Learning

Keywords

attention mechanism automatic speech recognition language model neural language model pointer network lattice rescoring video metadatum

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020