Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models
EMNLP 2025
Composed Image Retrieval for Training-Free Domain Conversion
WACV 2025
DELOC: Document Element Localizer
EMNLP 2025
ComicScene154: A Scene Dataset for Comic Analysis
EMNLP 2025
VISaGE: Understanding Visual Generics and Exceptions
EMNLP 2025
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
EMNLP 2025
Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models
EMNLP 2025
Concept-pedia: a Wide-coverage Semantically-annotated Multimodal Dataset
EMNLP 2025
iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models
EMNLP 2025
From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens
EMNLP 2025
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications
EMNLP 2025
Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance
EMNLP 2025
Generating Spatial Knowledge Graphs from Automotive Diagrams for Question Answering
EMNLP 2025
PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models
EMNLP 2025
From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
EMNLP 2025
CAPSTONE: Composable Attribute‐Prompted Scene Translation for Zero‐Shot Vision–Language Reasoning
EMNLP 2025
See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models
ACL 2025
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
ACL 2025
ExMute: A Context-Enriched Multimodal Dataset for Hateful Memes
COLING 2025
Sign2Vis: Automated Data Visualization from Sign Language
ACL 2025
Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders
NAACL 2025
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
ACL 2025
Unbiased Missing-modality Multimodal Learning
ICCV 2025
MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens
ACL 2025
CUET_Novice@DravidianLangTech 2025: A Multimodal Transformer-Based Approach for Detecting Misogynistic Memes in Malayalam Language
NAACL 2025
<
1
…
9
10
11
…
51
>