Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Soravit Changpinyo; Piyush Sharma; Nan Ding; Radu Soricut

2021 CVPR CVPR 2021

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Abstract

The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

📈 Trend Setter — Foundation Models

🧭 Keyword Pioneer — long-tail recognition

🐣 Hot Topic Early Bird — vision-language pre-training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Soravit Changpinyo , Piyush Sharma , Nan Ding , Radu Soricut

Topics

Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Techniques > Pretraining Computer Vision > Generation > Image Captioning Machine Learning > Learning Types > Transfer Learning Deep Learning > Models > Foundation Models Deep Learning > Learning Types > Multi-Modal Learning Computer Vision > Core AI > Multi-Modal Learning

Keywords

visual question answering image captioning image-text pair downstream task vision-language pre-training long-tail recognition vision-language pretraining conceptual caption long-tail visual recognition web-scale dataset

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021