Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Computer Vision
›
Core AI
›
Multimodal Learning
1257 directly classified papers
Papers per year
2008: 1
2009: 2
2010: 2
2011: 1
2012: 3
2013: 3
2014: 2
2015: 5
2017: 11
2018: 25
2019: 33
2020: 66
2021: 47
2022: 113
2023: 199
2024: 325
2025: 411
2026: 8
Papers
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
EMNLP 2022
GHAN: Graph-Based Hierarchical Aggregation Network for Text-Video Retrieval
EMNLP 2022
Extending Phrase Grounding with Pronouns in Visual Dialogues
EMNLP 2022
Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering
EMNLP 2022
Ring That Bell: A Corpus and Method for Multimodal Metaphor Detection in Videos
EMNLP 2022
Do Decoding Algorithms Capture Discourse Structure in Multi-Modal Tasks? A Case Study of Image Paragraph Generation
EMNLP 2022
Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features
EMNLP 2022
Divert More Attention to Vision-Language Tracking
NIPS 2022
One-Shot Talking Face Generation from Single-Speaker Audio-Visual Correlation Learning
AAAI 2022
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
CVPR 2022
Neural Collaborative Graph Machines for Table Structure Recognition
CVPR 2022
Learning Affordance Grounding From Exocentric Images
CVPR 2022
Fine-Grained Semantically Aligned Vision-Language Pre-Training
NIPS 2022
MACK: Multimodal Aligned Conceptual Knowledge for Unpaired Image-text Matching
NIPS 2022
Cross-Modal Object Tracking: Modality-Aware Representations and a Unified Benchmark
AAAI 2022
A Multimodal Fusion-Based LNG Detection for Monitoring Energy Facilities (Student Abstract)
AAAI 2022
Bridging the Gap between Expression and Scene Text for Referring Expression Comprehension (Student Abstract)
AAAI 2022
Cross-Modal Mutual Learning for Audio-Visual Speech Recognition and Manipulation
AAAI 2022
Edge-Aware Guidance Fusion Network for RGB–Thermal Scene Parsing
AAAI 2022
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning
AAAI 2022
MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding
AAAI 2022
Weakly-Supervised Generation and Grounding of Visual Descriptions With Conditional Generative Models
CVPR 2022
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
IJCNLP 2022
DAVIS: Driver’s Audio-Visual Speech recognition
INTERSPEECH 2022
GraDual: Graph-Based Dual-Modal Representation for Image-Text Matching
WACV 2022
<
1
…
41
42
43
…
51
>