Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
CVPR 2020
ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
CVPR 2020
Active Speakers in Context
CVPR 2020
Learning to Represent Image and Text with Denotation Graph
EMNLP 2020
Reading Between the Lines: Exploring Infilling in Visual Narratives
EMNLP 2020
Visually Grounded Continual Learning of Compositional Phrases
EMNLP 2020
Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents
EMNLP 2020
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
EMNLP 2020
STL-CQA: Structure-based Transformers with Localization and Encoding for Chart Question Answering
EMNLP 2020
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings
EMNLP 2020
VD-BERT: A Unified Vision and Dialog Transformer with BERT
EMNLP 2020
Sub-Instruction Aware Vision-and-Language Navigation
EMNLP 2020
Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts
EMNLP 2020
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
EMNLP 2020
ConceptBert: Concept-Aware Representation for Visual Question Answering
EMNLP 2020
MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention
EMNLP 2020
They Are Not All Alike: Answering Different Spatial Questions Requires Different Grounding Strategies
EMNLP 2020
Retouchdown: Releasing Touchdown on StreetLearn as a Public Resource for Language Grounding Tasks in Street View
EMNLP 2020
Spatial Action Maps for Mobile Manipulation
RSS 2020
Multimodal Sign Language Recognition via Temporal Deformable Convolutional Sequence Learning
INTERSPEECH 2020
Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks
INTERSPEECH 2020
Fine-grained Image Classification and Retrieval by Combining Visual and Locally Pooled Textual Features
WACV 2020
Answering Questions about Data Visualizations using Efficient Bimodal Fusion
WACV 2020
Exploring Hate Speech Detection in Multimodal Publications
WACV 2020
Bi-Directional Relationship Inferring Network for Referring Image Segmentation
CVPR 2020
<
1
…
45
46
47
…
51
>