COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Yuelin Bai; Xeron Du; Yiming Liang; Leo Jin; Junting Zhou; Ziqiang Liu; Feiteng Fang; Mingshan Chang; Tianyu Zheng; Xincheng Zhang; Nuo Ma; Zekun Moore Wang; Ruibin Yuan; Haihong Wu; Hongquan Lin; Wenhao Huang; Jiajun Zhang; Chenghua Lin; Jie Fu; Min Yang; Shiwen Ni; Ge Zhang

2025 NAACL NAACL 2025

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Abstract

AbstractRemarkable progress on large language models (LLMs), particularly in English, has facilitated impressive capabilities in following human instructions. However, there remains a noticeable gap in instruction fine-tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users’ interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world data resources and undergoing comprehensive human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.

👥 Mega-Team — 22 authors

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuelin Bai , Xeron Du , Yiming Liang , Leo Jin , Junting Zhou , Ziqiang Liu , Feiteng Fang , Mingshan Chang , Tianyu Zheng , Xincheng Zhang , Nuo Ma , Zekun Moore Wang , Ruibin Yuan , Haihong Wu , Hongquan Lin , Wenhao Huang , Jiajun Zhang , Chenghua Lin , Jie Fu , Min Yang , Shiwen Ni , Ge Zhang

Topics

Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Large Language Models

Keywords

transfer learning instruction tuning instruction fine-tuning chinese language data mixing large language model

Download PDF

Few-shot Personalization of LLMs with Mis-aligned Responses 2025

NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals 2025

Understanding Figurative Meaning through Explainable Visual Entailment 2025

CogLM: Tracking Cognitive Development of Large Language Models 2025

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Abstract

Authors

Topics

Keywords

Related papers