← Learning Types

Machine Learning › Learning Types ›

Multimodal Learning

85 directly classified papers

Papers per year

Papers

Multimodal Prior Learning with Double Constraint Alignment for Snapshot Spectral Compressive Imaging IJCAI 2025

Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity CVPR 2025

Multimodal LLMs as Customized Reward Models for Text-to-Image Generation ICCV 2025

Streaming VideoLLMs for Real-Time Procedural Video Understanding ICCV 2025

Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma? ICCV 2025

HumorDB: Can AI understand graphical humor? ICCV 2025

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios EMNLP 2025

AI Knows Where You Are: Exposure, Bias, and Inference in Multimodal Geolocation with KoreaGEO EMNLP 2025

M-LongDoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework EMNLP 2025

5cNLP at BioLaySumm2025: Prompts, Retrieval, and Multimodal Fusion ACL 2025

CLEAR: Character Unlearning in Textual and Visual Modalities ACL 2025

Multimodal Invariant Sentiment Representation Learning ACL 2025

AccentFold: A Journey through African Accents for Zero-Shot ASR Adaptation to Target Accents EACL 2024

MemeMQA: Multimodal Question Answering for Memes via Rationale-Based Inferencing ACL 2024

YYama@Multimodal Hate Speech Event Detection 2024: Simpler Prompts, Better Results - Enhancing Zero-shot Detection with a Large Multimodal Model EACL 2024

CLTL@Multimodal Hate Speech Event Detection 2024: The Winning Approach to Detecting Multimodal Hate Speech and Its Targets EACL 2024

LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding COLING 2024

Multimodal Prompt Learning with Missing Modalities for Sentiment Analysis and Emotion Recognition ACL 2024

MemeGuard: An LLM and VLM-based Framework for Advancing Content Moderation via Meme Intervention ACL 2024

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction CVPR 2024

Data-Efficient Multimodal Fusion on a Single GPU CVPR 2024

Polos: Multimodal Metric Learning from Human Feedback for Image Captioning CVPR 2024

EEVR: A Dataset of Paired Physiological Signals and Textual Descriptions for Joint Emotion Representation Learning NIPS 2024

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding NIPS 2024

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction NIPS 2024