← Models

Deep Learning › Models ›

Vision-Language Models

685 directly classified papers

Papers per year

Papers

Scene-Text Aware Image and Text Retrieval with Dual-Encoder ACL 2022

Voxel-informed Language Grounding ACL 2022

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena ACL 2022

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension ACL 2022

Contrastive Visual Semantic Pretraining Magnifies the Semantics of Natural Language Representations ACL 2022

Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-Modal Knowledge Transfer ACL 2022

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection CVPR 2022

Towards Language-Free Training for Text-to-Image Generation CVPR 2022

Transform-Retrieve-Generate: Natural Language-Centric Outside-Knowledge Visual Question Answering CVPR 2022

SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Color Editing CVPR 2022

Image Segmentation Using Text and Image Prompts CVPR 2022

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality CVPR 2022

Align and Prompt: Video-and-Language Pre-Training With Entity Prompts CVPR 2022

Object-Aware Video-Language Pre-Training for Retrieval CVPR 2022

CLIP-Forge: Towards Zero-Shot Text-To-Shape Generation CVPR 2022

Make It Move: Controllable Image-to-Video Generation With Text Descriptions CVPR 2022

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies? EMNLP 2022

CPL: Counterfactual Prompt Learning for Vision and Language Models EMNLP 2022

TRIPS: Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection EMNLP 2022

An Anchor-based Relative Position Embedding Method for Cross-Modal Tasks EMNLP 2022

Towards Unifying Reference Expression Generation and Comprehension EMNLP 2022

mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections EMNLP 2022

Character-centric Story Visualization via Visual Planning and Token Alignment EMNLP 2022

Distilled Dual-Encoder Model for Vision-Language Understanding EMNLP 2022

VisToT: Vision-Augmented Table-to-Text Generation EMNLP 2022