Want To Reduce Labeling Cost? GPT-3 Can Help

Shuohang Wang; Yang Liu; Yichong Xu; Chenguang Zhu; Michael Zeng

2021 EMNLP EMNLP 2021

Want To Reduce Labeling Cost? GPT-3 Can Help

Abstract

AbstractData annotation is a time-consuming and labor-intensive process for many NLP tasks. Although there exist various methods to produce pseudo data labels, they are often task-specific and require a decent amount of labeled data to start with. Recently, the immense language model GPT-3 with 170 billion parameters has achieved tremendous improvement across many few-shot learning tasks. In this paper, we explore ways to leverage GPT-3 as a low-cost data labeler to train other models. We find that to make the downstream model achieve the same performance on a variety of NLU and NLG tasks, it costs 50% to 96% less to use labels from GPT-3 than using labels from humans. Furthermore, we propose a novel framework of combining pseudo labels from GPT-3 with human labels, which leads to even better performance. These results present a cost-effective data labeling methodology that is generalizable to many practical applications.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐣 Hot Topic Early Bird — data annotation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shuohang Wang , Yang Liu , Yichong Xu , Chenguang Zhu , Michael Zeng

Topics

Artificial Intelligence > Learning Paradigms > Few-Shot Learning Machine Learning > Learning Types > Weakly Supervised Learning Machine Learning > Learning Paradigms > Few-Shot Learning Machine Learning > Learning Types > Few-Shot Learning Deep Learning > Models > Large Language Models

Keywords

few-shot learning natural language processing data annotation pseudo labeling data labeling large language model labeling cost

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021