Enhancing Vision-Language Pre-training with Rich Supervisions

Yuan Gao; Kunyu Shi; Pengkai Zhu; Edouard Belval; Oren Nuriel; Srikar Appalaraju; Shabnam Ghadar; Zhuowen Tu; Vijay Mahadevan; Stefano Soatto

2024 CVPR CVPR 2024

Enhancing Vision-Language Pre-training with Rich Supervisions

Abstract

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4 we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that compared to current screenshot pre-training objectives our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection and at least 1% on Widget Captioning.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — widget captioning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuan Gao , Kunyu Shi , Pengkai Zhu , Edouard Belval , Oren Nuriel , Srikar Appalaraju , Shabnam Ghadar , Zhuowen Tu , Vijay Mahadevan , Stefano Soatto

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Machine Learning > Application Areas > Data Augmentation Deep Learning > Techniques > Pretraining Machine Learning > Learning Types > Transfer Learning Computer Vision > Domain-Specific > Document Analysis Deep Learning > Models > Foundation Models Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

multi-task learning spatial localization hierarchical structure vision-language model multimodal pretraining table detection screenshot understanding image-to-text model widget captioning

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024