Scaling Capability in Token Space: An Analysis of Large Vision Language Model

Tenghui Li; Guoxu Zhou; Xuyang Zhao; Qibin Zhao

2025 JMLR JMLR 2025

Scaling Capability in Token Space: An Analysis of Large Vision Language Model

Abstract

Large language models have demonstrated predictable scaling behaviors with respect to model parameters and training data. This study investigates whether a similar scaling relationship exist for vision-language models with respect to the number of vision tokens. A mathematical framework is developed to characterize a relationship between vision token number and the expected divergence of distance between vision-referencing sequences. The theoretical analysis reveals two distinct scaling regimes: sublinear scaling for less vision tokens and linear scaling for more vision tokens. This aligns with model performance relationships of the form \(S(n) \approx c / n^{\alpha(n)}\), where the scaling exponent relates to the correlation structure between vision token representations. Empirical validations across multiple vision-language benchmarks show that model performance matches the prediction from scaling relationship. The findings contribute to understanding vision token scaling in transformers through a theoretical framework that complements empirical observations. [abs] [ pdf ][ bib ] © JMLR 2025. (edit, beta)

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tenghui Li , Guoxu Zhou , Xuyang Zhao , Qibin Zhao

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Optimization & Theory > Theory Artificial Intelligence > Core AI > Large Language Models Deep Learning > Optimization & Theory > Theory

Keywords

representation learning multimodal learning theoretical analysis vision language model scaling law token representation

Download PDF

Related papers

On the Natural Gradient of the Evidence Lower Bound 2025

Four Axiomatic Characterizations of the Integrated Gradients Attribution Method 2025

Extending Temperature Scaling with Homogenizing Maps 2025

Ontolearn---A Framework for Large-scale OWL Class Expression Learning in Python 2025

An Axiomatic Definition of Hierarchical Clustering 2025