2025 COLING COLING 2025

CHIFRAUD: A Long-term Web Text Dataset for Chinese Fraud Detection

Abstract

AbstractDetecting fraudulent online text is essential, as these manipulative messages exploit human greed, deceive individuals, and endanger societal security. Currently, this task remains under-explored on the Chinese web due to the lack of a comprehensive dataset of Chinese fraudulent texts. However, creating such a dataset is challenging because it requires extensive annotation within a vast collection of normal texts. Additionally, the creators of fraudulent webpages continuously update their tactics to evade detection by downstream platforms and promote fraudulent messages. To this end, this work firstly presents the comprehensive long-term dataset of Chinese fraudulent texts collected over 12 months, consisting of 59,106 entries extracted from billions of web pages. Furthermore, we design and provide a wide range of baselines, including large language model-based detectors, and pre-trained language model approaches. The necessary dataset and benchmark codes for further research are available via https://github. com/xuemingxxx/ChiFraud.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio