Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Models
Deep Learning
›
Models
›
Vision-Language Models
685 directly classified papers
Papers per year
2015: 1
2016: 1
2017: 3
2018: 1
2019: 7
2020: 12
2021: 26
2022: 57
2023: 94
2024: 235
2025: 248
Papers
Improving CLIP Training with Language Rewrites
NIPS 2023
Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation
NIPS 2023
Quilt-1M: One Million Image-Text Pairs for Histopathology
NIPS 2023
CLIP4HOI: Towards Adapting CLIP for Practical Zero-Shot HOI Detection
NIPS 2023
Cola: A Benchmark for Compositional Text-to-image Retrieval
NIPS 2023
Test-Time Distribution Normalization for Contrastively Learned Visual-language Models
NIPS 2023
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
NIPS 2023
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
NIPS 2023
LAVIS: A One-stop Library for Language-Vision Intelligence
ACL 2023
Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models
ACL 2023
MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models
ACL 2023
A Neural Divide-and-Conquer Reasoning Framework for Image Retrieval from Linguistically Complex Text
ACL 2023
Character-Aware Models Improve Visual Text Rendering
ACL 2023
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
ACL 2023
MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering
ACL 2023
A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues
ACL 2023
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
ACL 2023
Measuring Progress in Fine-grained Vision-and-Language Understanding
ACL 2023
ROME: Evaluating Pre-trained Vision-Language Models on Reasoning beyond Visual Common Sense
EMNLP 2023
Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment
EMNLP 2023
NLIP: Noise-Robust Language-Image Pre-training
AAAI 2023
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
ACL 2023
VIPHY: Probing “Visible” Physical Commonsense Knowledge
EMNLP 2023
MM-Reasoner: A Multi-Modal Knowledge-Aware Framework for Knowledge-Based Visual Question Answering
EMNLP 2023
ECHo: A Visio-Linguistic Dataset for Event Causality Inference via Human-Centric Reasoning
EMNLP 2023
<
1
…
21
22
23
…
28
>