Learning from Scoring Disagreements: Contrastive Error Mining for Efficient and Robust LLM-based Assessment

Lei Chen; Tengteng Cheng; BoYu Gao; Zitao Liu; Weiqi Luo

2026 AAAI AAAI 2026

Learning from Scoring Disagreements: Contrastive Error Mining for Efficient and Robust LLM-based Assessment

Abstract

Abstract Automated grading of student responses still faces numerous challenges, particularly when dealing with complex and ambiguous answers. In particular, large models are prone to scoring bias when handling uncertain responses, and few-shot reasoning methods often lack stability, which limits their applicability in real educational scenarios. To tackle these challenges, we propose the Contrastive Error Mining and FineTuning (CEM-FT) framework, which automatically identifies high-value hard samples by analyzing scoring disagreements between a full fine-tuned model and a few-shot model. A lightweight LoRA adapter is then trained on these samples to refine model performance with minimal computational overhead. Experiments on the SciEntsbank, Beetle, and Mohler datasets show that CEM-FT can improve QWK by up to 3.9% compared to the fine-tuned Qwen model on SciEntsbank datasets, which is a significant improvement over the few-shot baseline. The proposed framework substantially enhances both scoring accuracy and consistency, providing a practical, robust solution for reliable automated assessment with large language models.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — contrastive error mining

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Lei Chen , Tengteng Cheng , BoYu Gao , Zitao Liu , Weiqi Luo

Topics

Machine Learning > Learning Types > Contrastive Learning Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Applications > Machine Reading Comprehension

Keywords

lora fine-tuning automated grading quadratic weighted kappa few-shot reasoning contrastive error mining scoring disagreement

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026