Aalamaram: A Large-Scale Linguistically Annotated Treebank for the Tamil Language

A M Abirami; Wei Qi Leong; Hamsawardhini Rengarajan; D Anitha; R Suganya; Himanshu Singh; Kengatharaiyer Sarveswaran; William Chandra Tjhi; Rajiv Ratn Shah

2024 COLING COLING 2024

Aalamaram: A Large-Scale Linguistically Annotated Treebank for the Tamil Language

Abstract

AbstractTamil is a relatively low-resource language in the field of Natural Language Processing (NLP). Recent years have seen a growth in Tamil NLP datasets in Natural Language Understanding (NLU) or Natural Language Generation (NLG) tasks, but high-quality linguistic resources remain scarce. In order to alleviate this gap in resources, this paper introduces Aalamaram, a treebank with rich linguistic annotations for the Tamil language. It is hitherto the largest publicly available Tamil treebank with almost 10,000 sentences from diverse sources and is annotated for the tasks of Part-of-speech (POS) tagging, Named Entity Recognition (NER), Morphological Parsing and Dependency Parsing. Close attention has also been paid to multi-word segmentation, especially in the context of Tamil clitics. Although the treebank is based largely on the Universal Dependencies (UD) specifications, significant effort has been made to adjust the annotation rules according to the idiosyncrasies and complexities of the Tamil language, thereby providing a valuable resource for linguistic research and NLP developments.

🌉 Interdisciplinary Bridge — Interdisciplinary and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

A M Abirami , Wei Qi Leong , Hamsawardhini Rengarajan , D Anitha , R Suganya , Himanshu Singh , Kengatharaiyer Sarveswaran , William Chandra Tjhi , Rajiv Ratn Shah

Topics

Natural Language Processing > Understanding > Named Entity Recognition Natural Language Processing > Understanding > Part-of-Speech Tagging Interdisciplinary > Linguistics > Computational Linguistics

Keywords

named entity recognition dependency parsing computational linguistics part-of-speech tagging morphological parsing

Download PDF

Zero-shot Cross-lingual Automated Essay Scoring 2024

A Challenge Dataset and Effective Models for Conversational Stance Detection 2024

A Computational Model of Latvian Morphology 2024

A Frustratingly Simple Decoding Method for Neural Text Generation 2024

Aalamaram: A Large-Scale Linguistically Annotated Treebank for the Tamil Language

Abstract

Authors

Topics

Keywords

Related papers