2022 IJCNLP IJCNLP 2022

When Cantonese NLP Meets Pre-training: Progress and Challenges

Abstract

AbstractCantonese is an influential Chinese variant with a large population of speakers worldwide. However, it is under-resourced in terms of the data scale and diversity, excluding Cantonese Natural Language Processing (NLP) from the stateof-the-art (SOTA) “pre-training and fine-tuning” paradigm. This tutorial will start with a substantially review of the linguistics and NLP progress for shaping language specificity, resources, and methodologies. It will be followed by an introduction to the trendy transformerbased pre-training methods, which have been largely advancing the SOTA performance of a wide range of downstream NLP tasks in numerous majority languages (e.g., English and Chinese). Based on the above, we will present the main challenges for Cantonese NLP in relation to Cantonese language idiosyncrasies of colloquialism and multilingualism, followed by the future directions to line NLP for Cantonese and other low-resource languages up to the cutting-edge pre-training practice.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing
🧭 Keyword Pioneer — cantonese nlp
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio