2024 COLING COLING 2024

New Evaluation Methodology for Qualitatively Comparing Classification Models

Abstract

AbstractText Classification is one of the most common tasks in Natural Language Processing. When proposing new classification models, practitioners select a sample of items the proposed model classified correctly while the baseline did not, and then try to observe patterns across those items to understand the proposed model’s strengths. However, this approach is not comprehensive and requires the effort of observing patterns across text items. In this work, we propose a new evaluation methodology for performing qualitative assessment over multiple classification models. The proposed methodology is driven to discover clusters of text items where each cluster’s items 1) exhibit a linguistic pattern and 2) the proposed model significantly outperforms the baseline when classifying such items. This helps practitioners in learning what their proposed model is powerful at capturing in comparison with the baseline model without having to perform this process manually. We use a fine-tuned BERT and Logistic Regression as the two models to compare with Sentiment Analysis as the downstream task. We show how our proposed evaluation methodology discovers various clusters of text items which BERT classifies significantly more accurately than the Logistic Regression baseline, thus providing insight into what BERT is powerful at capturing.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing
🧭 Keyword Pioneer — qualitative evaluation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio