← Learning Types

Deep Learning › Learning Types ›

Multimodal Learning

323 directly classified papers

Papers per year

Papers

Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding CVPR 2024

Language-Driven Anchors for Zero-Shot Adversarial Robustness CVPR 2024

Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters CVPR 2024

cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers NIPS 2024

LoTLIP: Improving Language-Image Pre-training for Long Text Understanding NIPS 2024

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance NIPS 2024

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models NIPS 2024

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models NIPS 2024

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens NIPS 2024

Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner NIPS 2024

Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network CVPR 2023

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-Training Model CVPR 2023

Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework NIPS 2023

ImageNetVC: Zero- and Few-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories EMNLP 2023

Structure and Content-Guided Video Synthesis with Diffusion Models ICCV 2023

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction ACL 2023

Measuring Progress in Fine-grained Vision-and-Language Understanding ACL 2023

Tackling Modality Heterogeneity with Multi-View Calibration Network for Multimodal Sentiment Detection ACL 2023

ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations CVPR 2023

AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR CVPR 2023

FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals in Factorized Orthogonal Latent Space NIPS 2023

Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation ACL 2023

MultiEMO: An Attention-Based Correlation-Aware Multimodal Fusion Framework for Emotion Recognition in Conversations ACL 2023

Visually-augmented pretrained language models for NLP tasks without images ACL 2023

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning ACL 2023