A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics

Angeline Charles; Srikant Panda; Amit Agarwal; Hitesh Laxmichand Patel; Priyaranjan Pattnayak; Bhargava Kumar; Tejaswini Kumar

2025 AACL AACL 2025

A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics

Abstract

AbstractReference-free metrics such as CLIPScore and PAC-S are increasingly used in vision-language tasks due to their scalability and independence from human-written references. However, their reliability under linguistic, visual, and cultural variation remains underexplored. In this work, we present a systematic audit of CLIPScore and PAC-S using an eight-factor diagnostic framework applied to MS-COCO validation images. Our analysis reveals consistent failure modes across dimensions including object size, content category, syntax, named entities, spatial relations and cultural context. Both metrics penalize captions referencing African (−5.5%, −4.8%) and Arabian (−4.9%, −5.3%) cultures, favor large-object and animal-centric scenes (by 20-30%) and show limited sensitivity to spatial negation and word order. CLIPScore correlates more strongly with syntactic complexity, while PAC-S demonstrates greater robustness to verbosity and named–entity variation highlighting complementary strengths rather than superiority. These findings expose cultural and content bias, weak semantic robustness, and limited compositional understanding. We conclude with design recommendations to improve fairness, scale invariance, and semantic grounding in future reference-free evaluation metrics.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Angeline Charles , Srikant Panda , Amit Agarwal , Hitesh Laxmichand Patel , Priyaranjan Pattnayak , Bhargava Kumar , Tejaswini Kumar

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Fairness

Keywords

semantic robustness vision-language model cultural bia reference-free metrics

Download PDF

Related papers

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems 2025

Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection 2025

CtrlShift: Steering Language Models for Dense Quotation Retrieval with Dynamic Prompts 2025

Small Changes, Large Consequences: Analyzing the Allocational Fairness of LLMs in Hiring Contexts 2025