Learning from Scoring Disagreements: Contrastive Error Mining for Efficient and Robust LLM-based Assessment
Abstract
Abstract Automated grading of student responses still faces numerous challenges, particularly when dealing with complex and ambiguous answers. In particular, large models are prone to scoring bias when handling uncertain responses, and few-shot reasoning methods often lack stability, which limits their applicability in real educational scenarios. To tackle these challenges, we propose the Contrastive Error Mining and FineTuning (CEM-FT) framework, which automatically identifies high-value hard samples by analyzing scoring disagreements between a full fine-tuned model and a few-shot model. A lightweight LoRA adapter is then trained on these samples to refine model performance with minimal computational overhead. Experiments on the SciEntsbank, Beetle, and Mohler datasets show that CEM-FT can improve QWK by up to 3.9% compared to the fine-tuned Qwen model on SciEntsbank datasets, which is a significant improvement over the few-shot baseline. The proposed framework substantially enhances both scoring accuracy and consistency, providing a practical, robust solution for reliable automated assessment with large language models.