2025 EMNLP EMNLP 2025

LimaCost: Data Valuation for Instruction Tuning of Large Language Models

Abstract

AbstractInstruction tuning (IT) is an effective approach for aligning large language models (LLMs) with human intentions. There is ongoing discourse regarding the data quality for IT. As an effort to find the robust criteria of data quality for IT, we introduce LimaCost, a data quality measure that exhibits a strong correlation with model performance. LimaCost utilizes LIMA dataset, which effectiveness in IT has already been validated by several previous works. LimaCost then estimates the value of a given data by estimating how many LIMA data points might be needed to approximate its gradient. Our experiments reveal that LimaCost enables effective data selection that derive high alignment performance. We demonstrate that selecting data based on high LimaCost proves to be more effective than existing data selection strategies.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio