2022 EMNLP EMNLP 2022

Does Sentence Segmentation Matter for Machine Translation?

Abstract

AbstractFor the most part, NLP applications operate at the sentence level. Since sentences occur most naturally in documents, they must be extracted and segmented via the use of a segmenter, of which there are a handful of options. There has been some work evaluating the performance of segmenters on intrinsic metrics, that look at their ability to recover human-segmented sentence boundaries, but there has been no work looking at the effect of segmenters on downstream tasks. We ask the question, “does segmentation matter?” and attempt to answer it on the task of machine translation. We consider two settings: the application of segmenters to a black-box system whose training segmentation is mostly unknown, as well as the variation in performance when segmenters are applied to the training process, too. We find that the choice of segmenter largely does not matter, so long as its behavior is not one of extreme under- or over-segmentation. For such settings, we provide some qualitative analysis examining their harms, and point the way towards document-level processing.

The Questioner
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio