← Resources & Methods

Natural Language Processing › Resources & Methods ›

Multimodal NLP

86 directly classified papers

Papers per year

Papers

SketchAgent: Language-Driven Sequential Sketch Generation CVPR 2025

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary CVPR 2025

Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP ICCV 2025

From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning ICCV 2025

Unified Multimodal Understanding via Byte-Pair Visual Encoding ICCV 2025

Improving Vision-and-Language Reasoning via Spatial Relations Modeling WACV 2024

CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free WACV 2024

Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models CVPR 2024

DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding ACL 2024

MM-SOC: Benchmarking Multimodal Large Language Models in Social Media Platforms ACL 2024

Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding EMNLP 2024

Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment CVPR 2024

Peacock: A Family of Arabic Multimodal Large Language Models and Benchmarks ACL 2024

Nebula: A discourse aware Minecraft Builder EMNLP 2024

Spontaneous gestures encoded by hand positions improve language models: An Information-Theoretic motivated study ACL 2023

CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training ACL 2023

Dynamic Inference With Grounding Based Vision and Language Models CVPR 2023

Position-Guided Text Prompt for Vision-Language Pre-Training CVPR 2023

Coupling Artificial Neurons in BERT and Biological Neurons in the Human Brain AAAI 2023

Clover: Towards a Unified Video-Language Alignment and Fusion Model CVPR 2023

Grounded and well-rounded: a methodological approach to the study of cross-modal and cross-lingual grounding EMNLP 2023

Revisiting the "Video" in Video-Language Understanding CVPR 2022

Bridging between Cognitive Processing Signals and Linguistic Features via a Unified Attentional Network AAAI 2022

Visual Grounding of Inter-lingual Word-Embeddings EMNLP 2022

CogBERT: Cognition-Guided Pre-trained Language Models COLING 2022