Learning To Name Classes for Vision and Language Models

Sarah Parisot; Yongxin Yang; Steven McDonagh

2023 CVPR CVPR 2023

Learning To Name Classes for Vision and Language Models

Abstract

Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. Two distinct challenges that remain however, are high sensitivity to the choice of handcrafted class names that define queries, and the difficulty of adaptation to new, smaller datasets. Towards addressing these problems, we propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content. By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names. We show that our solution can easily be integrated in image classification and object detection pipelines, yields significant performance gains in multiple scenarios and provides insights into model biases and labelling errors.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — class name learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sarah Parisot , Yongxin Yang , Steven McDonagh

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Embedding Learning Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Analysis > Object Detection Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Text Representation Artificial Intelligence > Core AI > Large Language Models Deep Learning > Learning Types > Zero-Shot Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

image classification zero-shot learning object detection text representation vision language model vision-language model word embedding text query zero-shot recognition class name learning class naming text-to-visual alignment

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023