Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
MNLP@DravidianLangTech 2025: Transformer-based Multimodal Framework for Misogyny Meme Detection
NAACL 2025
LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval
EMNLP 2025
iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models
EMNLP 2025
From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens
EMNLP 2025
PunMemeCN: A Benchmark to Explore Vision-Language Models’ Understanding of Chinese Pun Memes
EMNLP 2025
PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models
EMNLP 2025
Generating Spatial Knowledge Graphs from Automotive Diagrams for Question Answering
EMNLP 2025
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
CVPR 2025
Ambiguity-aware Multi-level Incongruity Fusion Network for Multi-Modal Sarcasm Detection
COLING 2025
A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling
IJCNLP 2025
HerWILL@DravidianLangTech 2025: Ensemble Approach for Misogyny Detection in Memes Using Pre-trained Text and Vision Transformers
NAACL 2025
TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation
CVPR 2025
Query-LIFE: Query-aware Language Image Fusion Embedding for E-Commerce Relevance
COLING 2025
Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding
CVPR 2025
LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes
CVPR 2025
Semantic and Sequential Alignment for Referring Video Object Segmentation
CVPR 2025
Beyond Visual Understanding Introducing PARROT-360V for Vision Language Model Benchmarking
COLING 2025
Chat-Driven Text Generation and Interaction for Person Retrieval
EMNLP 2025
TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection
EMNLP 2025
Diagram-Driven Course Questions Generation
EMNLP 2025
MAKAR: a Multi-Agent framework based Knowledge-Augmented Reasoning for Grounded Multimodal Named Entity Recognition
EMNLP 2025
Audio-centric Video Understanding Benchmark without Text Shortcut
EMNLP 2025
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
EMNLP 2025
Multimodal Language Models See Better When They Look Shallower
EMNLP 2025
Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation
ACL 2025
<
1
…
6
7
8
…
51
>