Language Model-Driven Data Pruning Enables Efficient Active Learning

Abdul Hameed Azeemi; Ihsan Ayyub Qazi; Agha Ali Raza

2026 EACL EACL 2026

Language Model-Driven Data Pruning Enables Efficient Active Learning

Abstract

AbstractActive learning (AL) optimizes data labeling efficiency by selecting the most informative instances for annotation. However, scaling active learning to large datasets remains a critical challenge, as AL acquisition functions incur prohibitive computational costs when evaluating large unlabeled data pools. To bridge this gap, we introduce a novel plug-and-play data pruning strategy, ActivePrune, which leverages language models to prune the unlabeled pool. ActivePrune implements a two-stage pruning process: an initial fast evaluation using perplexity scores from an n-gram language model, followed by a high-quality selection using metrics for data quality computed through a quantized LLM. To enhance the diversity of the unlabeled pool, we propose a novel perplexity reweighting method that systematically brings forward underrepresented instances for selection. Experiments on translation, sentiment analysis, topic classification, and summarization tasks on diverse datasets and AL strategies demonstrate that ActivePrune outperforms existing data pruning methods. Finally, we compare the selection quality ↔ efficiency tradeoff of the data pruning methods and show that ActivePrune provides up to 74% reduction in the end-to-end AL time compared to other LLM score-based pruning methods.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Abdul Hameed Azeemi , Ihsan Ayyub Qazi , Agha Ali Raza

Topics

Machine Learning > Learning Types > Active Learning Machine Learning > Application Areas > Efficient Computing

Keywords

active learning knowledge distillation language model data pruning perplexity score

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026