Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

DialogDraw: Image Generation and Editing System Based on Multi-Turn Dialogue AAAI 2025

GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art ACL 2025

Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension AAAI 2025

Graphic Design with Large Multimodal Model AAAI 2025

External Reliable Information-enhanced Multimodal Contrastive Learning for Fake News Detection AAAI 2025

Cross-Domain Trajectory Association Based on Hierarchical Spatiotemporal Enhanced Attention Hypergraph AAAI 2025

Debiased Multimodal Understanding for Human Language Sequences AAAI 2025

Decomposing and Fusing Intra- and Inter-Sensor Spatio-Temporal Signal for Multi-Sensor Wearable Human Activity Recognition AAAI 2025

MindTuner: Cross-Subject Visual Decoding with Visual Fingerprint and Semantic Correction AAAI 2025

Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration AAAI 2025

Towards Audio-Visual Navigation in Noisy Environments: A Large-Scale Benchmark Dataset and an Architecture Considering Multiple Sound-Sources AAAI 2025

Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization AAAI 2025

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning AAAI 2025

Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update AAAI 2025

Retention Score: Quantifying Jailbreak Risks for Vision Language Models AAAI 2025

PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection AAAI 2025

Differentiated Vision: Unveiling Entity-Specific Visual Modality Requirements for Multimodal Knowledge Graph EMNLP 2025

Read, Watch and Scream! Sound Generation from Text and Video AAAI 2025

Semi-Supervised Multi-View Multi-Label Learning with View-Specific Transformer and Enhanced Pseudo-Label AAAI 2025

Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues ACL 2025

Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists? ACL 2025

Visual Evidence Prompting Mitigates Hallucinations in Large Vision-Language Models ACL 2025

AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness ACL 2025

Sharper and Faster mean Better: Towards More Efficient Vision-Language Model for Hour-scale Long Video Understanding ACL 2025

Cultivating Gaming Sense for Yourself: Making VLMs Gaming Experts ACL 2025