Sentence-Alignment in Semi-parallel Datasets

Steffen Frenzel; Manfred Stede

2025 NAACL NAACL 2025

Sentence-Alignment in Semi-parallel Datasets

Abstract

AbstractIn this paper, we are testing sentence alignment on complex, semi-parallel corpora, i.e., different versions of the same text that have been altered to some extent. We evaluate two hypotheses: To make alignment algorithms more efficient, we test the hypothesis that matching pairs can be found in the immediate vicinity of the source sentence and that it is sufficient to search for paraphrases in a ‘context window’. To improve the alignment quality on complex, semi-parallel texts, we test the implementation of a segmentation into Elementary Discourse Units (EDUs) in order to make more precise alignments at this level. Since EDUs are the smallest possible unit for communicating a full proposition, we assume that aligning at this level can improve the overall quality. Both hypotheses are tested and validated with several embedding models on varying degrees of parallel German datasets. The advantages and disadvantages of the different approaches are presented, and our next steps are outlined.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio