Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization

Siyuan Chen; Colin A. Grambow; Mojtaba Kadkhodaie Elyaderani; Alireza Sadeghi; Federico Fancellu; Thomas Schaaf

2023 INTERSPEECH INTERSPEECH 2023

Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization

Abstract

Large-scale pre-training has been a successful strategy for training transformer models. However, maintaining a large clinical dataset for pre-training is not always possible, and access to data in this domain can be time-limited and costly. We explore using synthetic data in pre-training sequence-to-sequence (seq-to-seq) transformer models to generate clinical notes from Doctor-Patient-Conversations (DoPaCos). Using a generative language model fine-tuned on authentic conversations, a synthetic DoPaCo dataset was created and used with a corpus of clinical notes to pre-train a Longformer-Encoder-Decoder (LED) model. Results show that synthetic data leads to comparable performance in the downstream summarization task compared to pre-training with authentic data. Pre-training on synthetic conversations first, followed by clinical notes, yields higher performance across most of our evaluation metrics.

🌉 Interdisciplinary Bridge — Deep Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio