Building a Part-of-Speech Tagged Corpus for Drenjongke (Bhutia)

Mana Ashida; Seunghun Lee; Kunzang Namgyal

2020 AACL AACL 2020

Building a Part-of-Speech Tagged Corpus for Drenjongke (Bhutia)

Abstract

AbstractThis research paper reports on the generation of the first Drenjongke corpus based on texts taken from a phrase book for beginners, written in the Tibetan script. A corpus of sentences was created after correcting errors in the text scanned through optical character reading (OCR). A total of 34 Part-of-Speech (PoS) tags were defined based on manual annotation performed by the three authors, one of whom is a native speaker of Drenjongke. The first corpus of the Drenjongke language comprises 275 sentences and 1379 tokens, which we plan to expand with other materials to promote further studies of this language.

🚀 Conference Pioneer — AACL 2020

🌉 Interdisciplinary Bridge — Interdisciplinary and Natural Language Processing

🧭 Keyword Pioneer — optical character recognition

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Speech & Audio

🐣 Hot Topic Early Bird — endangered language

Authors

Mana Ashida , Seunghun Lee , Kunzang Namgyal

Topics

Natural Language Processing > Applications Interdisciplinary > Linguistics > Morphology

Keywords

part-of-speech tagging endangered language optical character recognition corpus annotation language documentation

Download PDF

Related papers

Can Monolingual Pretrained Models Help Cross-Lingual Classification? 2020

Text Simplification with Reinforcement Learning Using Supervised Rewards on Grammaticality, Meaning Preservation, and Simplicity 2020

ISA: An Intelligent Shopping Assistant 2020

Social Media Medical Concept Normalization using RoBERTa in Ontology Enriched Text Similarity Framework 2020

Overcoming Resistance: The Normalization of an Amazonian Tribal Language 2020