Towards L2-friendly pipelines for learner corpora: A case of written production by L2-Korean learners

Hakyung Sung; Gyu-Ho Shin

2023 ACL ACL 2023

Towards L2-friendly pipelines for learner corpora: A case of written production by L2-Korean learners

Abstract

AbstractWe introduce the Korean-Learner-Morpheme (KLM) corpus, a manually annotated dataset consisting of 129,784 morphemes from second language (L2) learners of Korean, featuring morpheme tokenization and part-of-speech (POS) tagging. We evaluate the performance of four Korean morphological analyzers in tokenization and POS tagging on the L2- Korean corpus. Results highlight the analyzers’ reduced performance on L2 data, indicating the limitation of advanced deep-learning models when dealing with L2-Korean corpora. We further show that fine-tuning one of the models with the KLM corpus improves its accuracy of tokenization and POS tagging on L2-Korean dataset.

🌉 Interdisciplinary Bridge — Interdisciplinary and Natural Language Processing

🧭 Keyword Pioneer — morpheme tokenization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio