PALO: A Polyglot Large Multimodal Model for 5B People

Hanoona Rasheed; Muhammad Maaz; Abdelrahman Shaker; Salman Khan; Hisham Cholakkal; Rao M. Anwer; Tim Baldwin; Michael Felsberg; Fahad S. Khan

2025 WACV WACV 2025

PALO: A Polyglot Large Multimodal Model for 5B People

Abstract

In pursuit of more inclusive Vision-Language Models (VLMs) this study introduces a Large Multilingual Multimodal Model called PALO. PALO offers visual reasoning capabilities in 10 major languages including English Chinese Hindi Spanish French Arabic Bengali Russian Urdu and Japanese that span a total of 5B people (65% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi Arabic Bengali and Urdu. The resulting models are trained across three scales (1.7B 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hanoona Rasheed , Muhammad Maaz , Abdelrahman Shaker , Salman Khan , Hisham Cholakkal , Rao M. Anwer , Tim Baldwin , Michael Felsberg , Fahad S. Khan

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

machine translation instruction tuning vision-language model model scaling

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025