LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition

Sreyan Ghosh; Sonal Kumar; Ashish Seth; Purva Chiniya; Utkarsh Tyagi; Ramani Duraiswami; Dinesh Manocha

2024 INTERSPEECH INTERSPEECH 2024

LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition

Abstract

Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of visually-conditioned (generative) ASR error correction. Specifically, we instruct an LLM to predict the transcription from the N-best hypotheses generated using ASR beam-search. This is further conditioned on lip motions. This approach addresses key challenges in traditional AVSR learning, such as the lack of large-scale paired datasets and difficulties in adapting to new domains. We experiment on 4 datasets in various settings and show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%. We also release LipHyp, a dataset with hypothesis-transcription pairs and lip motion cues.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — generative error correction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Sreyan Ghosh , Sonal Kumar , Ashish Seth , Purva Chiniya , Utkarsh Tyagi , Ramani Duraiswami , Dinesh Manocha

Topics

Deep Learning > Architectures > Transformers Speech & Audio > Recognition > Speech Recognition

Keywords

automatic speech recognition visual speech recognition generative error correction lip motion large language model

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024