SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Chull Hwan Song; Taebaek Hwang; Jooyoung Yoon; Shunghyun Choi; Yeong Hyeon Gu

2024 CVPR CVPR 2024

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Abstract

Vision-language models (VLMs) have made significant strides in cross-modal understanding through large-scale paired datasets. However in fashion domain datasets often exhibit a disparity between the information conveyed in image and text. This issue stems from datasets containing multiple images of a single fashion item all paired with one text leading to cases where some textual details are not visible in individual images. This mismatch particularly when non-co-occurring elements are masked undermines the training of conventional VLM objectives like Masked Language Modeling and Masked Image Modeling thereby hindering the model's ability to accurately align fine-grained visual and textual features. Addressing this problem we propose Synchronized attentional Masking (SyncMask) which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text. This synchronization is accomplished by harnessing cross-attentional features obtained from a momentum model ensuring a precise alignment between the two modalities. Additionally we enhance grouped batch sampling with semi-hard negatives effectively mitigating false negative issues in Image-Text Matching and Image-Text Contrastive learning objectives within fashion datasets. Our experiments demonstrate the effectiveness of the proposed approach outperforming existing methods in three downstream tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — cross-attentional feature

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chull Hwan Song , Taebaek Hwang , Jooyoung Yoon , Shunghyun Choi , Yeong Hyeon Gu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Contrastive Learning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Techniques > Attention

Keywords

knowledge distillation vision-language model masked language modeling image-text matching vision-language pretraining cross-attentional feature semi-hard negative

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024