Computer Vision › Core AI ›

Multimodal Learning

1257 directly classified papers

Papers per year

Papers

SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models EMNLP 2025

Composed Image Retrieval for Training-Free Domain Conversion WACV 2025

DELOC: Document Element Localizer EMNLP 2025

ComicScene154: A Scene Dataset for Comic Analysis EMNLP 2025

VISaGE: Understanding Visual Generics and Exceptions EMNLP 2025

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning EMNLP 2025

Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models EMNLP 2025

Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset EMNLP 2025

iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models EMNLP 2025

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens EMNLP 2025

PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications EMNLP 2025

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance EMNLP 2025

Generating Spatial Knowledge Graphs from Automotive Diagrams for Question Answering EMNLP 2025

PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models EMNLP 2025

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding EMNLP 2025

CAPSTONE: Composable Attribute‐Prompted Scene Translation for Zero‐Shot Vision–Language Reasoning EMNLP 2025

See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models ACL 2025

MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering ACL 2025

ExMute: A Context-Enriched Multimodal Dataset for Hateful Memes COLING 2025

Sign2Vis: Automated Data Visualization from Sign Language ACL 2025

Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders NAACL 2025

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers ACL 2025

Unbiased Missing-modality Multimodal Learning ICCV 2025

MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens ACL 2025

CUET_Novice@DravidianLangTech 2025: A Multimodal Transformer-Based Approach for Detecting Misogynistic Memes in Malayalam Language NAACL 2025