SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

Bogdan Gliwa; Iwona Mochol; Maciej Biesek; Aleksander Wawer

2019 EMNLP EMNLP 2019

SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

Abstract

AbstractThis paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news – in contrast with human evaluators’ judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chat-dialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — dialogue summarization

🐣 Hot Topic Early Bird — human annotation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Bogdan Gliwa , Iwona Mochol , Maciej Biesek , Aleksander Wawer

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Applications > Dialogue Systems Natural Language Processing > Applications > Summarization

Keywords

natural language processing text summarization human annotation dialogue summarization abstractive summarization dialogue dataset

Download PDF

Related papers

Read, Attend and Comment: A Deep Architecture for Automatic News Comment Generation 2019

Chains-of-Reasoning at TextGraphs 2019 Shared Task: Reasoning over Chains of Facts for Explainable Multi-hop Inference 2019

A Boundary-aware Neural Model for Nested Named Entity Recognition 2019

Iterative Dual Domain Adaptation for Neural Machine Translation 2019

A Multi-Pairwise Extension of Procrustes Analysis for Multilingual Word Translation 2019