Building a Japanese Typo Dataset from Wikipedia’s Revision History

Yu Tanaka; Yugo Murawaki; Daisuke Kawahara; Sadao Kurohashi

2020 ACL ACL 2020

Building a Japanese Typo Dataset from Wikipedia’s Revision History

Abstract

AbstractUser generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Data Science & Analytics and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — revision history

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio