FashionVLP: Vision Language Transformer for Fashion Retrieval With Feedback

Sonam Goenka; Zhaoheng Zheng; Ayush Jaiswal; Rakesh Chada; Yue Wu; Varsha Hedau; Pradeep Natarajan

2022 CVPR CVPR 2022

FashionVLP: Vision Language Transformer for Fashion Retrieval With Feedback

Abstract

Fashion image retrieval based on a query pair of reference image and natural language feedback is a challenging task that requires models to assess fashion related information from visual and textual modalities simultaneously. We propose a new vision-language transformer based model, FashionVLP, that brings the prior knowledge contained in large image-text corpora to the domain of fashion image re-trieval, and combines visual information from multiple levels of context to effectively capture fashion related information. While queries are encoded through the transformer layers, our asymmetric design adopts a novel attention-based approach for fusing target image features without involving text or transformer layers in the process. Extensive results show that FashionVLP achieves the state-of-the-art performance on benchmark datasets, with a large 23% relative improvement on the challenging FashionIQ dataset, which contains complex natural language feedback.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Science and Computer Vision and Deep Learning

🐣 Hot Topic Early Bird — vision language

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sonam Goenka , Zhaoheng Zheng , Ayush Jaiswal , Rakesh Chada , Yue Wu , Varsha Hedau , Pradeep Natarajan

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Computer Vision > Analysis > Object Detection Computer Science > Applications > Information Retrieval Computer Vision > Core AI > Multimodal Learning

Keywords

multimodal learning cross-modal retrieval vision-language transformer vision language fashion retrieval natural language feedback fashion image retrieval

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022