Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?

Xiaochen Wang; Heming Xia; Jialin Song; Longyu Guan; Qingxiu Dong; Rui Li; Yixin Yang; Yifan Pu; Weiyao Luo; Yiru Wang; Xiangdi Meng; Wenjie Li; Zhifang Sui

2025 EMNLP EMNLP 2025

Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?

Abstract

AbstractLarge Multimodal Models (LMMs) have demonstrated strong performance on vision-language benchmarks, yet current evaluations predominantly focus on single-image reasoning. In contrast, real-world scenarios always involve understanding sequences of images. A typical scenario is comic strips understanding, which requires models to perform nuanced visual reasoning beyond surface-level recognition. To address this gap, we introduce STRIPCIPHER , a benchmark designed to evaluate the model ability on understanding implicit narratives in silent comics. STRIPCIPHER is a high-quality, human-annotated dataset featuring fine-grained annotations and comprehensive coverage of varying difficulty levels. It comprises three tasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. % , covering various difficulty. Notably, evaluation results on STRIPCIPHER reveals a significant gap between current LMMs and human performance—e.g., GPT-4o achieves only 23.93% accuracy in the reordering task, 56.07% below human levels. These findings underscore the limitations of current LMMs in implicit visual narrative understanding and highlight opportunities for advancing sequential multimodal reasoning.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — visual narrative comprehension

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xiaochen Wang , Heming Xia , Jialin Song , Longyu Guan , Qingxiu Dong , Rui Li , Yixin Yang , Yifan Pu , Weiyao Luo , Yiru Wang , Xiangdi Meng , Wenjie Li , Zhifang Sui

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Processing > Video Understanding Deep Learning > Learning Types > Multi-Modal Learning

Keywords

visual reasoning large multimodal model multimodal reasoning sequential image reasoning sequential reasoning visual narrative comprehension temporal narrative reordering comic strip understanding implicit visual reasoning comic strip implicit narrative

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025