← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching AAAI 2025

VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization AAAI 2025

POS-Aware Neural Approaches for Word Alignment in Dravidian Languages COLING 2025

Language Driven Occupancy Prediction ICCV 2025

Chimera: Improving Generalist Model with Domain-Specific Experts ICCV 2025

EgoM2P: Egocentric Multimodal Multitask Pretraining ICCV 2025

Target Bias Is All You Need: Zero-Shot Debiasing of Vision-Language Models with Bias Corpus ICCV 2025

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness ICCV 2025

XTrack: Multimodal Training Boosts RGB-X Video Object Trackers ICCV 2025

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning ICCV 2025

BlinkTrack: Feature Tracking over 80 FPS via Events and Images ICCV 2025

MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation ICCV 2025

StyleMotif: Multi-Modal Motion Stylization using Style-Content Cross Fusion ICCV 2025

ADIEE: Automatic Dataset Creation and Scorer for Instruction-Guided Image Editing Evaluation ICCV 2025

Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences ICCV 2025

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning ICCV 2025

PanSt3R: Multi-view Consistent Panoptic Segmentation ICCV 2025

G2SF: Geometry-Guided Score Fusion for Multimodal Industrial Anomaly Detection ICCV 2025

SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World ICCV 2025

CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation ICCV 2025

Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology ICCV 2025

DADM: Dual Alignment of Domain and Modality for Face Anti-spoofing ICCV 2025

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance ICCV 2025

ReMP-AD: Retrieval-enhanced Multi-modal Prompt Fusion for Few-Shot Industrial Visual Anomaly Detection ICCV 2025

Audio-centric Video Understanding Benchmark without Text Shortcut EMNLP 2025