2020
EMNLP
EMNLP 2020
fugashi, a Tool for Tokenizing Japanese in Python
Abstract
AbstractRecent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.
🌉
Interdisciplinary Bridge
— Interdisciplinary and Natural Language Processing
🧭
Keyword Pioneer
— open source tokenizer
🐣
Hot Topic Early Bird
— japanese language
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio
Authors
Topics
Natural Language Processing > Understanding > Syntax
Natural Language Processing > Resources & Methods > Multilingual NLP
Natural Language Processing > Resources & Methods > Text Representation
Interdisciplinary > Linguistics
Interdisciplinary > Linguistics > Computational Linguistics
Natural Language Processing > Applications > Text Processing