Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Models
Deep Learning
›
Models
›
Vision-Language Models
685 directly classified papers
Papers per year
2015: 1
2016: 1
2017: 3
2018: 1
2019: 7
2020: 12
2021: 26
2022: 57
2023: 94
2024: 235
2025: 248
Papers
Scene-Text Aware Image and Text Retrieval with Dual-Encoder
ACL 2022
Voxel-informed Language Grounding
ACL 2022
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
ACL 2022
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension
ACL 2022
Contrastive Visual Semantic Pretraining Magnifies the Semantics of Natural Language Representations
ACL 2022
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-Modal Knowledge Transfer
ACL 2022
GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection
CVPR 2022
Towards Language-Free Training for Text-to-Image Generation
CVPR 2022
Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual Question Answering
CVPR 2022
SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Color Editing
CVPR 2022
Image Segmentation Using Text and Image Prompts
CVPR 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
CVPR 2022
Align and Prompt: Video-and-Language Pre-Training With Entity Prompts
CVPR 2022
Object-Aware Video-Language Pre-Training for Retrieval
CVPR 2022
CLIP-Forge: Towards Zero-Shot Text-To-Shape Generation
CVPR 2022
Make It Move: Controllable Image-to-Video Generation With Text Descriptions
CVPR 2022
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
EMNLP 2022
CPL: Counterfactual Prompt Learning for Vision and Language Models
EMNLP 2022
TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
EMNLP 2022
An Anchor-based Relative Position Embedding Method for Cross-Modal Tasks
EMNLP 2022
Towards Unifying Reference Expression Generation and Comprehension
EMNLP 2022
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
EMNLP 2022
Character-centric Story Visualization via Visual Planning and Token Alignment
EMNLP 2022
Distilled Dual-Encoder Model for Vision-Language Understanding
EMNLP 2022
VisToT: Vision-Augmented Table-to-Text Generation
EMNLP 2022
<
1
…
24
25
26
27
28
>