Enhancing Vision Language Corruption Robustness using Cross-Distribution & Prompted Denoisers

Sameer Shafayet Latif; Sadab Shiper; K. M. Rahiduzzaman Kiran; Md Farhan Ishmam; Md Azam Hossain; Abu Raihan Mostofa Kamal; Md Hamjajul Ashmafee

2026 WACV WACV 2026

Enhancing Vision Language Corruption Robustness using Cross-Distribution & Prompted Denoisers

Abstract

While the current generation of Vision Language Models (VLMs) has excelled in ideal conditions, their performance drops significantly when exposed to realistic multimodal corruptions, such as blurry images and grammatically incorrect text. Our work addresses this by establishing a novel multimodal corruption and denoising benchmark, VLSRB, with a rich suite of 18 visual and 18 textual corruption functions, to evaluate the system robustness of VLMs. To enhance robustness, we employ: (i) cross-distribution visual denoisers, inspired by the Mixture of Experts (MoE) architecture, and (ii) a prompted zero-shot textual denoiser. Our experiments reveal an overall accuracy gain of up to 5.5%, while revealing the vulnerability of models to specific corruptions and their over-reliance on the textual modality. We envision that the behavioral insights from our benchmark will help in developing robust VLM systems. Our code is available at: https://github.com/farhanishmam/VLMDenoising.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio