Synthesizing Text-to-SQL Data from Weak and Strong LLMs

Jiaxi Yang; Binyuan Hui; Min Yang; Jian Yang; Junyang Lin; Chang Zhou

2024 ACL ACL 2024

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

Abstract

AbstractThe capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.

🧭 Keyword Pioneer — text-to-sql synthesis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

Authors

Jiaxi Yang , Binyuan Hui , Min Yang , Jian Yang , Junyang Lin , Chang Zhou

Topics

Natural Language Processing > Applications > Information Extraction Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Transfer Learning Machine Learning > Learning Types > Preference Learning

Keywords

domain generalization preference learning instruction tuning language model synthetic datum text-to-sql synthesis large language model

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024