Huatuo-26M, a Large-scale Chinese Medical QA Dataset

Xidong Wang; Jianquan Li; Shunian Chen; Yuxuan Zhu; XiangBo Wu; Zhiyi Zhang; Xiaolong Xu; Junying Chen; Jie Fu; Xiang Wan; Anningzhe Gao; Benyou Wang

2025 NAACL NAACL 2025

Huatuo-26M, a Large-scale Chinese Medical QA Dataset

Abstract

AbstractLarge Language Models infuse newfound vigor into the advancement of the medical domain, yet the scarcity of data poses a significant bottleneck hindering community progress. In this paper, we release the largest ever medical Question Answering (QA) dataset with 26 Million QA pairs named Huatuo-26M. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. We also experimentally show the benefit of the proposed dataset in many aspects: (i) it serves as a fine-tuning data for training medical Large Language Models (LLMs); (ii) it works as an external knowledge source for retrieval-augmented generation (RAG); (iii) it demonstrates transferability by enhancing zero-shot performance on other QA datasets; and (iv) it aids in training biomedical model as a pre-training corpus. Our empirical findings substantiate the dataset’s utility in these domains, thereby confirming its significance as a resource in the medical QA landscape.

🌉 Interdisciplinary Bridge — Healthcare & Medicine and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — chinese medical nlp

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xidong Wang , Jianquan Li , Shunian Chen , Yuxuan Zhu , XiangBo Wu , Zhiyi Zhang , Xiaolong Xu , Junying Chen , Jie Fu , Xiang Wan , Anningzhe Gao , Benyou Wang

Topics

Natural Language Processing > Applications > Question Answering Healthcare & Medicine > Research > Medical AI Machine Learning > Learning Types > Transfer Learning Healthcare & Medicine > Clinical > Medical AI

Keywords

retrieval augmented generation zero-shot learning knowledge distillation zero-shot transfer retrieval-augmented generation medical domain medical question answering chinese medical nlp medical llm large language model

Download PDF

Few-shot Personalization of LLMs with Mis-aligned Responses 2025

NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals 2025

Understanding Figurative Meaning through Explainable Visual Entailment 2025

CogLM: Tracking Cognitive Development of Large Language Models 2025

Huatuo-26M, a Large-scale Chinese Medical QA Dataset

Abstract

Authors

Topics

Keywords

Related papers