VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Jiliang Hu; Zuchao Li; Ping Wang; Haojun Ai; Lefei Zhang; Hai Zhao

2024 EMNLP EMNLP 2024

VHASR: A Multimodal Speech Recognition System With Vision Hotwords

Abstract

AbstractThe image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model’s speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model’s speech recognition ability. Its performance not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — vision hotword

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiliang Hu , Zuchao Li , Ping Wang , Haojun Ai , Lefei Zhang , Hai Zhao

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Automatic Speech Recognition Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

multimodal learning automatic speech recognition multimodal speech recognition vision hotword dual-stream architecture image-based asr image-based multimodal

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024