Adapting Vision-Language Models for E-commerce Understanding at Scale

Matteo Nulli; Orshulevich Vladimir; Tala Bazazo; Christian Herold; Michael Kozielski; Marcin Mazur; Szymon Tuzel; Cees G. M. Snoek; Seyyed Hadi Hashemi; Omar Javed; Yannick Versley; Shahram Khadivi

2026 EACL EACL 2026

Adapting Vision-Language Models for E-commerce Understanding at Scale

Abstract

AbstractE-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision–Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Matteo Nulli , Orshulevich Vladimir , Tala Bazazo , Christian Herold , Michael Kozielski , Marcin Mazur , Szymon Tuzel , Cees G. M. Snoek , Seyyed Hadi Hashemi , Omar Javed , Yannick Versley , Shahram Khadivi

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Adaptation

Keywords

domain adaptation multimodal learning attribute extraction vision-language model product understanding

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026