Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
Just KIDDIN’ : Knowledge Infusion and Distillation for Detection of INdecent Memes
ACL 2025
Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains
ACL 2025
Express What You See: Can Multimodal LLMs Decode Visual Ciphers with Intuitive Semiosis Comprehension?
ACL 2025
Chat-Driven Text Generation and Interaction for Person Retrieval
EMNLP 2025
MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval
ACL 2025
MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models
ACL 2025
MemeQA: Holistic Evaluation for Meme Understanding
ACL 2025
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
ACL 2025
WAFFLE: Fine-tuning Multi-Modal Model for Automated Front-End Development
ACL 2025
V-Oracle: Making Progressive Reasoning in Deciphering Oracle Bones for You and Me
ACL 2025
Unbiased Missing-modality Multimodal Learning
ICCV 2025
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
ACL 2025
Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users
ACL 2025
MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching
ACL 2025
VideoRAG: Retrieval-Augmented Generation over Video Corpus
ACL 2025
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
ACL 2025
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
ACL 2025
Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times
ACL 2025
R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding
ACL 2025
MMInA: Benchmarking Multihop Multimodal Internet Agents
ACL 2025
MNLP@DravidianLangTech 2025: A Deep Multimodal Neural Network for Hate Speech Detection in Dravidian Languages
NAACL 2025
FOCUS: Evaluating Pre-trained Vision-Language Models on Underspecification Reasoning
ACL 2025
Sign2Vis: Automated Data Visualization from Sign Language
ACL 2025
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
ACL 2025
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
ACL 2025
<
1
2
3
4
5
…
51
>