Aggregating Crowd of LLMs for Cost-Effective Data Annotation

Jiacheng Liu; Xiaofeng Hou

2026 EACL EACL 2026

Aggregating Crowd of LLMs for Cost-Effective Data Annotation

Abstract

AbstractRecent advancements in Large Language Models (LLMs) have shown promise for automated data annotation, yet reliance on expensive commercial models like GPT-4 limits accessibility. This paper rigorously evaluates the potential of open-source smaller LLMs (sLLMs) as a cost-effective alternative. We introduce a new benchmark dataset, Multidisciplinary Open Research Data (MORD), comprising 12,277 annotated sentence segments from 1,500 schoolarly articles across five research domains, to systematically assess sLLM performance. Our experiments demonstrate that sLLMs achieve annotation quality surpassing Amazon MTurk workers and approach GPT-4’s accuracy at significantly lower costs. We further propose to build the Crowd of LLMs, which aggregates annotations from multiple sLLMs using label aggregation algorithms. This approach not only outperforms individual sLLMs but also reveals that combining sLLM annotations with human crowd labels yields superior results compared to either method alone. Our findings highlight the viability of sLLMs for democratizing high-quality data annotation while underscoring the need for tailored aggregation methods to fully realize their potential.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiacheng Liu , Xiaofeng Hou

Topics

Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Text Representation

Keywords

data annotation label aggregation crowd sourcing large language model

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026