Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research

Yida Mu; Mali Jin; Xingyi Song; Nikolaos Aletras

2024 EMNLP EMNLP 2024

Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research

Abstract

AbstractResearch in natural language processing (NLP) for Computational Social Science (CSS) heavily relies on data from social media platforms. This data plays a crucial role in the development of models for analysing socio-linguistic phenomena within online communities. In this work, we conduct an in-depth examination of 20 datasets extensively used in NLP for CSS to comprehensively examine data quality. Our analysis reveals that social media datasets exhibit varying levels of data duplication. Consequently, this gives rise to challenges like label inconsistencies and data leakage, compromising the reliability of models. Our findings also suggest that data duplication has an impact on the current claims of state-of-the-art performance, potentially leading to an overestimation of model effectiveness in real-world scenarios. Finally, we propose new protocols and best practices for improving dataset development from social media data and its usage.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Data Science & Analytics and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — label inconsistency

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Yida Mu , Mali Jin , Xingyi Song , Nikolaos Aletras

Topics

Artificial Intelligence > Core AI > Responsible AI Machine Learning > Core Methods > Classification Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Applications > Text Processing Data Science & Analytics > Applications > Social Media Analysis

Keywords

social media analysis data leakage data quality dataset quality computational social science data deduplication label inconsistency

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024