Beyond Averages: Learning with Annotator Disagreement in STS

Alejandro Benito-Santos; Adrian Ghajari

2025 EMNLP EMNLP 2025

Beyond Averages: Learning with Annotator Disagreement in STS

Abstract

AbstractThis work investigates capturing and modeling disagreement in Semantic Textual Similarity (STS), where sentence pairs are assigned ordinal similarity labels (0–5). Conventional STS systems average multiple annotator scores and focus on a single numeric estimate, overlooking label dispersion. By leveraging the disaggregated SemEval-2015 dataset (Soft-STS-15), this paper proposes and compares two disagreement-aware strategies that treat STS as an ordinal distribution prediction problem: a lightweight truncated Gaussian head for standard regression models, and a cross-encoder trained with a distance-aware objective, refined with temperature scaling. Results show improved performance in distance-based metrics, with the calibrated soft-label model proving best overall and notably more accurate on the most ambiguous pairs. This demonstrates that modeling disagreement benefits both calibration and ranking accuracy, highlighting the value of retaining and modeling full annotation distributions rather than collapsing them to a single mean label.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — ordinal distribution prediction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Alejandro Benito-Santos , Adrian Ghajari

Topics

Machine Learning > Core Methods > Regression Machine Learning > Core Methods > Representation Learning Machine Learning > Optimization & Theory > Stochastic Processes Natural Language Processing > Applications > Natural Language Inference Machine Learning > Learning Types > Calibration

Keywords

ordinal regression temperature scaling semantic textual similarity annotator disagreement soft-label learning ordinal distribution prediction distance-aware objective label dispersion

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025