PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

Franck Dernoncourt; Ji Young Lee

2017 IJCNLP IJCNLP 2017

PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

Abstract

AbstractWe present PubMed 200k RCT, a new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.

🌉 Interdisciplinary Bridge — Healthcare & Medicine and Natural Language Processing

🧭 Keyword Pioneer — randomized controlled trial

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Franck Dernoncourt , Ji Young Lee

Topics

Natural Language Processing > Applications > Text Classification Healthcare & Medicine > Clinical > Clinical NLP

Keywords

natural language processing text classification medical text randomized controlled trial sequential sentence classification scientific abstract medical text classification sentence labeling

Download PDF

Related papers

Procedural Text Generation from an Execution Video 2017

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset 2017

Roles and Success in Wikipedia Talk Pages: Identifying Latent Patterns of Behavior 2017

Alibaba at IJCNLP-2017 Task 1: Embedding Grammatical Features into LSTMs for Chinese Grammatical Error Diagnosis Task 2017

NCTU-NTUT at IJCNLP-2017 Task 2: Deep Phrase Embedding using bi-LSTMs for Valence-Arousal Ratings Prediction of Chinese Phrases 2017