← Learning Types

Deep Learning › Learning Types ›

Multimodal Learning

323 directly classified papers

Papers per year

Papers

Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding CVPR 2024

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models CVPR 2024

Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers ACL 2024

MCIL: Multimodal Counterfactual Instance Learning for Low-resource Entity-based Multimodal Information Extraction COLING 2024

Large Language Models can Share Images, Too! ACL 2024

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models ACL 2024

CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning CVPR 2024

Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs EMNLP 2024

Identifiable Object-Centric Representation Learning via Probabilistic Slot Attention NIPS 2024

Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks EMNLP 2024

Large Language Models Are Challenged by Habitat-Centered Reasoning EMNLP 2024

The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention EMNLP 2024

GraphVis: Boosting LLMs with Visual Knowledge Graph Integration NIPS 2024

GOME: Grounding-based Metaphor Binding With Conceptual Elaboration For Figurative Language Illustration EMNLP 2024

Extending AZee with Non-manual Gesture Rules for French Sign Language COLING 2024

InteRead: An Eye Tracking Dataset of Interrupted Reading COLING 2024

Unveiling the mystery of visual attributes of concrete and abstract concepts: Variability, nearest neighbors, and challenging categories EMNLP 2024

Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation CVPR 2024

PALM: Few-Shot Prompt Learning for Audio Language Models EMNLP 2024

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI CVPR 2024

Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization CVPR 2024

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation CVPR 2024

Revisiting Multimodal Transformers for Tabular Data with Text Fields ACL 2024

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs CVPR 2024

RWKV-CLIP: A Robust Vision-Language Representation Learner EMNLP 2024