Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection

Youxuan Jiang; Jonathan K. Kummerfeld; Walter S. Lasecki

2017 ACL ACL 2017

Understanding Task Design Trade-offs in Crowdsourced Paraphrase Collection

Abstract

AbstractLinguistically diverse datasets are critical for training and evaluating robust machine learning systems, but data collection is a costly process that often requires experts. Crowdsourcing the process of paraphrase generation is an effective means of expanding natural language datasets, but there has been limited analysis of the trade-offs that arise when designing tasks. In this paper, we present the first systematic study of the key factors in crowdsourcing paraphrase collection. We consider variations in instructions, incentives, data domains, and workflows. We manually analyzed paraphrases for correctness, grammaticality, and linguistic diversity. Our observations provide new insight into the trade-offs between accuracy and diversity in crowd responses that arise as a result of task design, providing guidance for future paraphrase generation procedures.

🧭 Keyword Pioneer — linguistic diversity

🐣 Hot Topic Early Bird — paraphrase generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Youxuan Jiang , Jonathan K. Kummerfeld , Walter S. Lasecki

Topics

Data Science & Analytics > Methods > Data Mining

Keywords

paraphrase generation annotation quality linguistic diversity data collection task design

Download PDF

Related papers

A* CCG Parsing with a Supertag and Dependency Factored Model 2017

Detecting annotation noise in automatically labelled data 2017

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 2017

Annotating tense, mood and voice for English, French and German 2017

Word Embedding for Response-To-Text Assessment of Evidence 2017