Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Minghan Wang; Yuxia Wang; Thuy-Trang Vu; Ehsan Shareghi; Reza Haf

2024 EMNLP EMNLP 2024

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Abstract

AbstractRecent advancements in multimodal large language models (MLLMs) have made significant progress in integrating information across various modalities, yet real-world applications in educational and scientific domains remain challenging. This paper introduces the Multimodal Scientific ASR (MS-ASR) task, which focuses on transcribing scientific conference videos by leveraging visual information from slides to enhance the accuracy of technical terminologies. Realized that traditional metrics like WER fall short in assessing performance accurately, prompting the proposal of severity-aware WER (SWER) that considers the content type and severity of ASR errors. We propose the Scientific Vision Augmented ASR (SciVASR) framework as a baseline method, enabling MLLMs to improve transcript quality through post-editing. Evaluations of state-of-the-art MLLMs, including GPT-4o, show a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Minghan Wang , Yuxia Wang , Thuy-Trang Vu , Ehsan Shareghi , Reza Haf

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Resources & Methods > Large Language Models Speech & Audio > Recognition > Automatic Speech Recognition Artificial Intelligence > Core AI > Multi-Modal Learning Natural Language Processing > Applications > Multimodal NLP

Keywords

speech recognition multimodal learning video understanding automatic speech recognition multimodal large language model scientific text video transcription

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024