ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Chunyuan Li; Haotian Liu; Liunian Li; Pengchuan Zhang; Jyoti Aneja; Jianwei Yang; Ping Jin; Houdong Hu; Zicheng Liu; Yong Jae Lee; Jianfeng Gao

2022 NIPS NeurIPS 2022

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Abstract

Learning visual representations from natural language supervision has recently shown great promise in a number of pioneering works. In general, these language-augmented visual models demonstrate strong transferability to a variety of datasets/tasks. However, it remains challenging to evaluate the transferablity of these foundation models due to the lack of easy-to-use toolkits for fair benchmarking. To tackle this, we build ELEVATER (Evaluation of Language-augmented Visual Task-level Transfer), the first benchmark to compare and evaluate pre-trained language-augmented visual models. Several highlights include: (i) Datasets. As downstream evaluation suites, it consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge. (ii) Toolkit. An automatic hyper-parameter tuning toolkit is developed to ensure the fairness in model adaption. To leverage the full power of language-augmented visual models, novel language-aware initialization methods are proposed to significantly improve the adaption performance. (iii) Metrics. A variety of evaluation metrics are used, including sample-efficiency (zero-shot and few-shot) and parameter-efficiency (linear probing and full model fine-tuning). We will publicly release ELEVATER.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — language-augmented visual model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chunyuan Li , Haotian Liu , Liunian Li , Pengchuan Zhang , Jyoti Aneja , Jianwei Yang , Ping Jin , Houdong Hu , Zicheng Liu , Yong Jae Lee , Jianfeng Gao

Topics

Machine Learning > Core Methods > Classification Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Analysis > Object Detection Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Transfer Learning Deep Learning > Models > Vision-Language Models

Keywords

zero-shot learning few-shot learning transfer learning object detection vision-language model linear probing language-augmented visual model

Download PDF

Related papers

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching 2022

A Theoretical View on Sparsely Activated Networks 2022

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks 2022

Matryoshka Representation Learning 2022

Off-Policy Evaluation with Deficient Support Using Side Information 2022