2024 INTERSPEECH INTERSPEECH 2024

Speech Recognition Models are Strong Lip-readers

Abstract

In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequence, allowing a pre-trained ASR model to directly perform lip-reading. The mapping can be learnt simply by backpropagating the cross-entropy loss on the text labels through the pre-trained, frozen ASR model. We achieve an impressive gain of 5.7 WER in the low data regime on the LRS3 benchmark over previous lip-reading methods. Finally, we demonstrate that the same strategy can be extended to other visual speech tasks, such as identifying the spoken language in silent videos.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🧭 Keyword Pioneer — visual speech
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio