Learning Visual Composition through Improved Semantic Guidance

Austin Stone; Hagen Soltau; Robert Geirhos; Xi Yi; Ye Xia; Bingyi Cao; Kaifeng Chen; Abhijit Ogale; Jonathon Shlens

2025 CVPR CVPR 2025

Learning Visual Composition through Improved Semantic Guidance

Abstract

Visual imagery does not consist of solitary objects, but in-stead reflects the composition of a multitude of fluid con-cepts. While there have been great advances in visual repre-sentation learning, such advances have focused on buildingbetter representations for a small number of discrete objectsbereft of an understanding of how these objects are inter-acting. One can observe this limitation in representationslearned through captions or contrastive learning - wherethe learned model treats an image essentially as a bag ofwords. Several works have attempted to address this lim-itation through the development of bespoke architectures.In this work, we focus on simple and scalable approaches.In particular, we demonstrate that by improving weakly la-beled data, i.e. captions, we can vastly improve the perfor-mance of standard contrastive learning approaches. Previ-ous CLIP models achieved near chance rate on challengingtasks probing compositional learning. However, our sim-ple approach boosts performance of CLIP substantially andachieves state of the art results on compositional bench-marks such as ARO and SugarCrepe. Furthermore, weshowcase our results on a relatively new captioning bench-mark derived from DOCCI. We demonstrate through a se-ries of ablations that a standard CLIP model trained withenhanced data may demonstrate impressive performance onimage retrieval tasks.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Austin Stone , Hagen Soltau , Robert Geirhos , Xi Yi , Ye Xia , Bingyi Cao , Kaifeng Chen , Abhijit Ogale , Jonathon Shlens

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Contrastive Learning Machine Learning > Learning Types > Weakly Supervised Learning

Keywords

contrastive learning image retrieval visual representation compositional learning foundation model semantic guidance

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025