The Korean Morphologically Tight-Fitting Tokenizer for Noisy User-Generated Texts

Sangah Lee; Hyopil Shin

2021 EMNLP EMNLP 2021

The Korean Morphologically Tight-Fitting Tokenizer for Noisy User-Generated Texts

Abstract

AbstractUser-generated texts include various types of stylistic properties, or noises. Such texts are not properly processed by existing morpheme analyzers or language models based on formal texts such as encyclopedias or news articles. In this paper, we propose a simple morphologically tight-fitting tokenizer (K-MT) that can better process proper nouns, coinages, and internet slang among other types of noise in Korean user-generated texts. We tested our tokenizer by performing classification tasks on Korean user-generated movie reviews and hate speech datasets, and the Korean Named Entity Recognition dataset. Through our tests, we found that K-MT is better fit to process internet slangs, proper nouns, and coinages, compared to a morpheme analyzer and a character-level WordPiece tokenizer.

🌉 Interdisciplinary Bridge — Interdisciplinary and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Sangah Lee , Hyopil Shin

Topics

Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Understanding > Named Entity Recognition Natural Language Processing > Applications > Text Classification Natural Language Processing > Resources & Methods > Text Representation Interdisciplinary > Linguistics > Morphology Machine Learning > Core Methods > Feature Learning

Keywords

text classification named entity recognition morphological analysis hate speech detection user-generated content korean language processing

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021