To Split or Not to Split: Composing Compounds in Contextual Vector Spaces

Chris Jenkins; Filip Miletić; Sabine Schulte im Walde

2023 EMNLP EMNLP 2023

To Split or Not to Split: Composing Compounds in Contextual Vector Spaces

Abstract

AbstractWe investigate the effect of sub-word tokenization on representations of German noun compounds: single orthographic words which are composed of two or more constituents but often tokenized into units that are not morphologically motivated or meaningful. Using variants of BERT models and tokenization strategies on domain-specific restricted diachronic data, we introduce a suite of evaluations relying on the masked language modelling task and compositionality prediction. We obtain the most consistent improvements by pre-splitting compounds into constituents.

🐣 Hot Topic Early Bird — german language

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Chris Jenkins , Filip Miletić , Sabine Schulte im Walde

Topics

Natural Language Processing > Resources & Methods > Lexical Semantics Natural Language Processing > Resources & Methods > Text Representation

Keywords

bert model german language morphological analysis compositionality prediction compound word sub-word tokenization

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023