Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Models
Deep Learning
›
Models
›
Vision-Language Models
685 directly classified papers
Papers per year
2015: 1
2016: 1
2017: 3
2018: 1
2019: 7
2020: 12
2021: 26
2022: 57
2023: 94
2024: 235
2025: 248
Papers
ImageBind: One Embedding Space To Bind Them All
CVPR 2023
Detecting Everything in the Open World: Towards Universal Object Detection
CVPR 2023
CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset
CVPR 2023
Doubly Right Object Recognition: A Why Prompt for Visual Rationales
CVPR 2023
CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching
CVPR 2023
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
CVPR 2023
GeneCIS: A Benchmark for General Conditional Image Similarity
CVPR 2023
A-Cap: Anticipation Captioning With Commonsense Knowledge
CVPR 2023
Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce
CVPR 2023
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning
CVPR 2023
3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions
CVPR 2023
Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring
CVPR 2023
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
NIPS 2023
VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models
NIPS 2023
LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models
ACL 2023
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
NIPS 2023
Text Promptable Surgical Instrument Segmentation with Vision-Language Models
NIPS 2023
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
NIPS 2023
Three Towers: Flexible Contrastive Learning with Pretrained Image Models
NIPS 2023
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP
NIPS 2023
Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models
NIPS 2023
LOVM: Language-Only Vision Model Selection
NIPS 2023
Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions
NIPS 2023
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
NIPS 2023
Visual Instruction Tuning
NIPS 2023
<
1
…
20
21
22
…
28
>