AudioVSR: Enhancing Video Speech Recognition with Audio Data

Xiaoda Yang; Xize Cheng; Jiaqi Duan; Hongshun Qiu; Minjie Hong; Minghui Fang; Shengpeng Ji; Jialong Zuo; Zhiqing Hong; Zhimeng Zhang; Tao Jin

2024 EMNLP EMNLP 2024

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Abstract

AbstractVisual Speech Recognition (VSR) aims to predict spoken content by analyzing lip movements in videos. Recently reported state-of-the-art results in VSR often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are insufficient compared to the audio data. To further enhance the VSR model using the audio data, we employed a generative model for data inflation, integrating the synthetic data with the authentic visual data. Essentially, the generative model incorporates another insight, which enhances the capabilities of the recognition model. For the cross-language issue, previous work has shown poor performance with non-Indo-European languages. We trained a multi-language-family modal fusion model, AudioVSR. Leveraging the concept of modal transfer, we achieved significant results in downstream VSR tasks under conditions of data scarcity. To the best of our knowledge, AudioVSR represents the first work on cross-language-family audio-lip alignment, achieving a new SOTA in the cross-language scenario.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xiaoda Yang , Xize Cheng , Jiaqi Duan , Hongshun Qiu , Minjie Hong , Minghui Fang , Shengpeng Ji , Jialong Zuo , Zhiqing Hong , Zhimeng Zhang , Tao Jin

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Transfer Learning Deep Learning > Models > Generative Models Speech & Audio > Recognition > Speech Recognition Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

visual speech recognition generative model multimodal fusion lip movement audio-visual alignment modal fusion

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024