Revisiting Vision-Language Foundations for No-Reference Image Quality Assessment

Ankit Yadav; Ta Duc Huy; Lingqiao Liu

2026 WACV WACV 2026

Revisiting Vision-Language Foundations for No-Reference Image Quality Assessment

Abstract

Large-scale vision-language pre-training has recently shown promise for no-reference image-quality assessment(NR-IQA), yet the relative merits of modern Vision Transformer foundations remain poorly understood. In this work.We present the first systematic evaluation of six prominent pretrained vision-language backbones, CLIP, SigLIP2, DINOv2, DINOv3, Perception, and ResNet--for the task of No-Reference Image Quality Assessment (NR-IQA), each finetuned using an identical lightweight MLP head. Our study uncovers two previously overlooked factors: (1) SigLIP2 consistently achieves strong performance; and (2) the choice of activation function plays a surprisingly crucial role, particularly for enhancing the generalization ability of imagequality assessment models. Notably, we find that simple sigmoid activations outperform commonly used ReLU and GELU on several benchmarks. Motivated by this finding, we introduce a learnable activation selection mechanism that adaptively determines the nonlinearity for each channel, eliminating the need for manual activation design. achieving new state-of-the-art SRCC on CLIVE, KADID, and AGIQA-3K. Extensive ablations confirm the benefits across architectures and regimes, establishing strong, resource-efficient NR-IQA baselines.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio