← Resources & Methods

Natural Language Processing › Resources & Methods ›

Multimodal NLP

86 directly classified papers

Papers per year

Papers

SketchAgent: Language-Driven Sequential Sketch Generation CVPR 2025

2M-BELEBELE: Highly Multilingual Speech and American Sign Language Comprehension Dataset Download PDF ACL 2025

M2-TabFact: Multi-Document Multi-Modal Fact Verification with Visual and Textual Representations of Tabular Data ACL 2025

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary CVPR 2025

GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs CVPR 2025

MIRe: Enhancing Multimodal Queries Representation via Fusion-Free Modality Interaction for Multimodal Retrieval ACL 2025

REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark ACL 2025

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering CVPR 2025

Multimodal Coreference Resolution for Chinese Social Media Dialogues: Dataset and Benchmark Approach ACL 2025

SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization EMNLP 2025

Cross-Aligned Fusion for Multimodal Understanding WACV 2025

Deciphering the Complaint Aspects: Towards an Aspect-Based Complaint Identification Model with Video Complaint Dataset in Finance WACV 2025

UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation SEMEVAL 2025

Vision-Language Models Can't See the Obvious ICCV 2025

AIMA at SemEval-2025 Task 1: Bridging Text and Image for Idiomatic Knowledge Extraction via Mixture of Experts SEMEVAL 2025

Multi-Schema Proximity Network for Composed Image Retrieval ICCV 2025

Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions NAACL 2025

VLG-BERT: Towards Better Interpretability in LLMs through Visual and Linguistic Grounding NAACL 2025

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis CVPR 2025

PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction ICCV 2025

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation ICCV 2025

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP ICCV 2025

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning ICCV 2025

Unified Multimodal Understanding via Byte-Pair Visual Encoding ICCV 2025

Towards Natural Language-Based Document Image Retrieval: New Dataset and Benchmark CVPR 2025