Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
FREE: Fast and Robust Vision Language Models with Early Exits
ACL 2025
Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation
ACL 2025
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
EMNLP 2025
Ambiguity-aware Multi-level Incongruity Fusion Network for Multi-Modal Sarcasm Detection
COLING 2025
HerWILL@DravidianLangTech 2025: Ensemble Approach for Misogyny Detection in Memes Using Pre-trained Text and Vision Transformers
NAACL 2025
Query-LIFE: Query-aware Language Image Fusion Embedding for E-Commerce Relevance
COLING 2025
UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
ACL 2025
Beyond Visual Understanding Introducing PARROT-360V for Vision Language Model Benchmarking
COLING 2025
DLRG@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
NAACL 2025
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
ACL 2025
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
ACL 2025
MemeQA: Holistic Evaluation for Meme Understanding
ACL 2025
MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval
ACL 2025
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
ACL 2025
V-Oracle: Making Progressive Reasoning in Deciphering Oracle Bones for You and Me
ACL 2025
Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times
ACL 2025
VisuoThink: Empowering LVLM Reasoning with Multimodal Tree Search
ACL 2025
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
ACL 2025
Enhancing Multimodal Retrieval via Complementary Information Extraction and Alignment
ACL 2025
InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
ACL 2025
OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval
ACL 2025
WAFFLE: Fine-tuning Multi-Modal Model for Automated Front-End Development
ACL 2025
Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures
ACL 2025
Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities
ACL 2025
VADE: Visual Attention Guided Hallucination Detection and Elimination
ACL 2025
<
1
2
3
4
5
…
51
>