← Learning Types

Deep Learning › Learning Types ›

Multimodal Learning

323 directly classified papers

Papers per year

Papers

External Memory Matters: Generalizable Object-Action Memory for Retrieval-Augmented Long-Term Video Understanding IJCAI 2025

PresentAgent: Multimodal Agent for Presentation Video Generation EMNLP 2025

Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision? ACL 2025

Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation ICCV 2025

Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval ICCV 2025

Unified Multimodal Understanding via Byte-Pair Visual Encoding ICCV 2025

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning NAACL 2025

Let Modalities Teach Each Other: Modal-Collaborative Knowledge Extraction and Fusion for Multimodal Knowledge Graph Completion NAACL 2025

UniEDU: Toward Unified and Efficient Large Multimodal Models for Educational Tasks EMNLP 2025

TEAM_STRIKERS@DravidianLangTech2025: Misogyny Meme Detection in Tamil Using Multimodal Deep Learning NAACL 2025

LegoSLM: Connecting LLM with Speech Encoder using CTC Posteriors EMNLP 2025

The_Deathly_Hallows@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages NAACL 2025

ClimbingCap: Multi-Modal Dataset and Method for Rock Climbing in World Coordinate CVPR 2025

PS-Diffusion: Photorealistic Subject-Driven Image Editing with Disentangled Control and Attention CVPR 2025

DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding CVPR 2025

YNU-HPCC at SemEval-2025 Task 1: Enhancing Multimodal Idiomaticity Representation via LoRA and Hybrid Loss Optimization SEMEVAL 2025

Zero-Shot Image Captioning with Multi-type Entity Representations AAAI 2025

ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval AAAI 2025

Cross-Aligned Fusion for Multimodal Understanding WACV 2025

BIG-FUSION: Brain-Inspired Global-Local Context Fusion Framework for Multimodal Emotion Recognition in Conversations AAAI 2025

GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model CVPR 2025

Explanation Bottleneck Models AAAI 2025

Can VLMs Actually See and Read? A Survey on Modality Collapse in Vision-Language Models ACL 2025

Multi-Modal Synergistic Implicit Image Enhancement for Efficient Optical Flow Estimation CVPR 2025

AXIS: Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents ACL 2025