Training Hybrid Language Models by Marginalizing over Segmentations

Edouard Grave; Sainbayar Sukhbaatar; Piotr Bojanowski; Armand Joulin

2019 ACL ACL 2019

Training Hybrid Language Models by Marginalizing over Segmentations

Abstract

AbstractIn this paper, we study the problem of hybrid language modeling, that is using models which can predict both characters and larger units such as character ngrams or words. Using such models, multiple potential segmentations usually exist for a given string, for example one using words and one using characters only. Thus, the probability of a string is the sum of the probabilities of all the possible segmentations. Here, we show how it is possible to marginalize over the segmentations efficiently, in order to compute the true probability of a sequence. We apply our technique on three datasets, comprising seven languages, showing improvements over a strong character level language model.

🧭 Keyword Pioneer — hybrid language model

🌉 Interdisciplinary Bridge — Deep Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Edouard Grave , Sainbayar Sukhbaatar , Piotr Bojanowski , Armand Joulin

Topics

Natural Language Processing > Generation > Language Modeling Deep Learning > Learning Types > Sequence Modeling

Keywords

word segmentation language model hybrid language model character level model sequence probability character modeling

Download PDF

Related papers

What do phone embeddings learn about Phonology? 2019

Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages 2019

Understanding Undesirable Word Embedding Associations 2019

Inferential Machine Comprehension: Answering Questions by Recursively Deducing the Evidence Chain from Text 2019

Domain Adaptation of Neural Machine Translation by Lexicon Induction 2019