Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs

Young-Suk Lee; Md Sultan; Yousef El-Kurdi; Tahira Naseem; Asim Munawar; Radu Florian; Salim Roukos; Ramón Astudillo

2023 EMNLP EMNLP 2023

Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs

Abstract

AbstractUsing in-context learning (ICL) for data generation, techniques such as Self-Instruct (Wang et al., 2023) or the follow-up Alpaca (Taori et al., 2023) can train strong conversational agents with only a small amount of human supervision. One limitation of these approaches is that they resort to very large language models (around 175B parameters) that are also proprietary and non-public. Here we explore the application of such techniques to language models that are much smaller (around 10B–40B parameters) and have permissive licenses. We find the Self-Instruct approach to be less effective at these sizes and propose new ICL methods that draw on two main ideas: (a) categorization and simplification of the ICL templates to make prompt learning easier for the LM, and (b) ensembling over multiple LM outputs to help select high-quality synthetic examples. Our algorithm leverages the 175 Self-Instruct seed tasks and employs separate pipelines for instructions that require an input and instructions that do not. Empirical investigations with different LMs show that: (1) Our proposed method yields higher-quality instruction tuning data than Self-Instruct, (2) It improves performances of both vanilla and instruction-tuned LMs by significant margins, and (3) Smaller instruction-tuned LMs generate more useful examples than their larger un-tuned counterparts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — data generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Young-Suk Lee , Md Sultan , Yousef El-Kurdi , Tahira Naseem , Asim Munawar , Radu Florian , Salim Roukos , Ramón Astudillo

Topics

Deep Learning > Architectures > Transformers Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > In-Context Learning Machine Learning > Core Methods > Ensemble Methods Deep Learning > Learning Types > Fine-Tuning Deep Learning > Learning Types > In-Context Learning

Keywords

in-context learning synthetic data generation instruction tuning ensemble method language model synthetic datum data generation

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023