Computer Vision › Core AI ›

Multimodal Learning

1257 directly classified papers

Papers per year

Papers

MNLP@DravidianLangTech 2025: Transformer-based Multimodal Framework for Misogyny Meme Detection NAACL 2025

LLM-Driven Completeness and Consistency Evaluation for Cultural Heritage Data Augmentation in Cross-Modal Retrieval EMNLP 2025

iKnow-audio: Integrating Knowledge Graphs with Audio-Language Models EMNLP 2025

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens EMNLP 2025

PunMemeCN: A Benchmark to Explore Vision-Language Models’ Understanding of Chinese Pun Memes EMNLP 2025

PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models EMNLP 2025

Generating Spatial Knowledge Graphs from Automotive Diagrams for Question Answering EMNLP 2025

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation CVPR 2025

Ambiguity-aware Multi-level Incongruity Fusion Network for Multi-Modal Sarcasm Detection COLING 2025

A Multimodal Recaptioning Framework to Account for Perceptual Diversity Across Languages in Vision-Language Modeling IJCNLP 2025

HerWILL@DravidianLangTech 2025: Ensemble Approach for Misogyny Detection in Memes Using Pre-trained Text and Vision Transformers NAACL 2025

TSAM: Temporal SAM Augmented with Multimodal Prompts for Referring Audio-Visual Segmentation CVPR 2025

Query-LIFE: Query-aware Language Image Fusion Embedding for E-Commerce Relevance COLING 2025

Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding CVPR 2025

LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes CVPR 2025

Semantic and Sequential Alignment for Referring Video Object Segmentation CVPR 2025

Beyond Visual Understanding Introducing PARROT-360V for Vision Language Model Benchmarking COLING 2025

Chat-Driven Text Generation and Interaction for Person Retrieval EMNLP 2025

TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection EMNLP 2025

Diagram-Driven Course Questions Generation EMNLP 2025

MAKAR: a Multi-Agent framework based Knowledge-Augmented Reasoning for Grounded Multimodal Named Entity Recognition EMNLP 2025

Audio-centric Video Understanding Benchmark without Text Shortcut EMNLP 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration EMNLP 2025

Multimodal Language Models See Better When They Look Shallower EMNLP 2025

Enhance Multimodal Consistency and Coherence for Text-Image Plan Generation ACL 2025