Faithful Persona-based Conversational Dataset Generation with Large Language Models

Pegah Jandaghi; Xianghai Sheng; Xinyi Bai; Jay Pujara; Hakim Sidahmed

2024 ACL ACL 2024

Faithful Persona-based Conversational Dataset Generation with Large Language Models

Abstract

AbstractHigh-quality conversational datasets are essential for developing AI models that can communicate with users.One way to foster deeper interactions between a chatbot and its user is through *personas*, aspects of the user’s character that provide insights into their personality, motivations, and behaviors.Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations.The Generator is an LLM prompted to output conversations.The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations.These experts select the best generated conversations, which we then use to improve the Generator.We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat.We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during an AI detection test decreases from 17.2% to 8.8% over three iterations.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — generator-critic framework

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Pegah Jandaghi , Xianghai Sheng , Xinyi Bai , Jay Pujara , Hakim Sidahmed

Topics

Natural Language Processing > Generation > Dialogue Systems Natural Language Processing > Resources & Methods > Large Language Models Deep Learning > Models > Large Language Models Machine Learning > Learning Types > Generative Models Artificial Intelligence > Core AI > Dialogue Systems

Keywords

synthetic datum generator-critic framework dataset generation conversational dataset llm prompting large language model synthetic dataset generation persona-based dialogue persona-based conversation

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024