Enhancing Vision Language Corruption Robustness using Cross-Distribution & Prompted Denoisers
Abstract
While the current generation of Vision Language Models (VLMs) has excelled in ideal conditions, their performance drops significantly when exposed to realistic multimodal corruptions, such as blurry images and grammatically incorrect text. Our work addresses this by establishing a novel multimodal corruption and denoising benchmark, VLSRB, with a rich suite of 18 visual and 18 textual corruption functions, to evaluate the system robustness of VLMs. To enhance robustness, we employ: (i) cross-distribution visual denoisers, inspired by the Mixture of Experts (MoE) architecture, and (ii) a prompted zero-shot textual denoiser. Our experiments reveal an overall accuracy gain of up to 5.5%, while revealing the vulnerability of models to specific corruptions and their over-reliance on the textual modality. We envision that the behavioral insights from our benchmark will help in developing robust VLM systems. Our code is available at: https://github.com/farhanishmam/VLMDenoising.