Assessing Crowdsourced Annotations with LLMs: Linguistic Certainty as a Proxy for Trustworthiness

Tianyi Li; Divya Sree; Tatiana Ringenberg

2025 NAACL NAACL 2025

Assessing Crowdsourced Annotations with LLMs: Linguistic Certainty as a Proxy for Trustworthiness

Abstract

AbstractHuman-annotated data is fundamental for training machine learning models, yet crowdsourced annotations often contain noise and bias. In this paper, we investigate the feasibility of employing large language models (LLMs), specifically GPT-4, as evaluators of crowdsourced annotations using a zero-shot prompting strategy. We introduce a certainty-based approach that leverages linguistic cues categorized into five levels (Absolute, High, Moderate, Low, Uncertain) based on Rubin’s framework—to assess the trustworthiness of LLM-generated evaluations. Using the MAVEN dataset as a case study, we compare GPT-4 evaluations against human evaluations and observe that the alignment between LLM and human judgments is strongly correlated with response certainty. Our results indicate that LLMs can effectively serve as a preliminary filter to flag potentially erroneous annotations for further expert review.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Interdisciplinary and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — linguistic certainty

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio