← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

Feature Design for Bridging SAM and CLIP toward Referring Image Segmentation WACV 2025

Temporally Streaming Audio-Visual Synchronization for Real-World Videos WACV 2025

POS-Aware Neural Approaches for Word Alignment in Dravidian Languages COLING 2025

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks RSS 2025

If I feel smart, I will do the right thing: Combining Complementary Multimodal Information in Visual Language Models COLING 2025

Electromyography-Informed Facial Expression Reconstruction for Physiological-Based Synthesis and Analysis CVPR 2025

Click&Describe: Multimodal Grounding and Tracking for Aerial Objects WACV 2025

Multi-Resolution Guided 3D GANs for Medical Image Translation WACV 2025

VideoGameBunny: Towards Vision Assistants for Video Games WACV 2025

CIOL at SemEval-2025 Task 11: Multilingual Pre-trained Model Fusion for Text-based Emotion Recognition SEMEVAL 2025

GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-Grained Video-Language Learning WACV 2025

Zhoumou at SemEval-2025 Task 1: Leveraging Multimodal Data Augmentation and Large Language Models for Enhanced Idiom Understanding SEMEVAL 2025

HeightMapNet: Explicit Height Modeling for End-to-End HD Map Learning WACV 2025

Event-Guided Fusion-Mamba for Context-Aware 3D Human Pose Estimation WACV 2025

SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering WACV 2025

PrevPredMap: Exploring Temporal Modeling with Previous Predictions for Online Vectorized HD Map Construction WACV 2025

3D Part Segmentation via Geometric Aggregation of 2D Visual Features WACV 2025

CTYUN-AI at SemEval-2025 Task 1: Learning to Rank for Idiomatic Expressions SEMEVAL 2025

Deduce and Select Evidences with Language Models for Training-Free Video Goal Inference WACV 2025

AIDE: Improving 3D Open-Vocabulary Semantic Segmentation by Aligned Vision-Language Learning WACV 2025

Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions CVPR 2025

Latency Robust Cooperative Perception using Asynchronous Feature Fusion WACV 2025

VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos WACV 2025

Combining Inherent Knowledge of Vision-Language Models with Unsupervised Domain Adaptation through Strong-Weak Guidance WACV 2025

VCRMNER: Visual Cue Refinement in Multimodal NER using CLIP Prompts COLING 2025