Segmentation-Free Word Embedding for Unsegmented Languages

Takamasa Oshikiri

2017 EMNLP EMNLP 2017

Segmentation-Free Word Embedding for Unsegmented Languages

Abstract

AbstractIn this paper, we propose a new pipeline of word embedding for unsegmented languages, called segmentation-free word embedding, which does not require word segmentation as a preprocessing step. Unlike space-delimited languages, unsegmented languages, such as Chinese and Japanese, require word segmentation as a preprocessing step. However, word segmentation, that often requires manually annotated resources, is difficult and expensive, and unavoidable errors in word segmentation affect downstream tasks. To avoid these problems in learning word vectors of unsegmented languages, we consider word co-occurrence statistics over all possible candidates of segmentations based on frequent character n-grams instead of segmented sentences provided by conventional word segmenters. Our experiments of noun category prediction tasks on raw Twitter, Weibo, and Wikipedia corpora show that the proposed method outperforms the conventional approaches that require word segmenters.

🌉 Interdisciplinary Bridge — Deep Learning and Interdisciplinary and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — unsegmented language

🐣 Hot Topic Early Bird — word segmentation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Takamasa Oshikiri

Topics

Natural Language Processing > Resources & Methods > Text Representation Interdisciplinary > Linguistics > Computational Linguistics Machine Learning > Learning Types > Representation Learning Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Representation Learning

Keywords

text representation word segmentation co-occurrence statistics word embedding character n-gram chinese language unsegmented language

Download PDF

Related papers

Reinforced Video Captioning with Entailment Rewards 2017

Cross-lingual Character-Level Neural Morphological Tagging 2017

Inter-Weighted Alignment Network for Sentence Pair Modeling 2017

Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings 2017

An Empirical Analysis of Edit Importance between Document Versions 2017