TUDA-CCL at SemEval-2021 Task 1: Using Gradient-boosted Regression Tree Ensembles Trained on a Heterogeneous Feature Set for Predicting Lexical Complexity

Sebastian Gombert; Sabine Bartsch

2021 ACL ACL 2021

TUDA-CCL at SemEval-2021 Task 1: Using Gradient-boosted Regression Tree Ensembles Trained on a Heterogeneous Feature Set for Predicting Lexical Complexity

Abstract

AbstractIn this paper, we present our systems submitted to SemEval-2021 Task 1 on lexical complexity prediction. The aim of this shared task was to create systems able to predict the lexical complexity of word tokens and bigram multiword expressions within a given sentence context, a continuous value indicating the difficulty in understanding a respective utterance. Our approach relies on gradient boosted regression tree ensembles fitted using a heterogeneous feature set combining linguistic features, static and contextualized word embeddings, psycholinguistic norm lexica, WordNet, word- and character bigram frequencies and inclusion in wordlists to create a model able to assign a word or multiword expression a context-dependent complexity score. We can show that especially contextualised string embeddings can help with predicting lexical complexity.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — gradient boosted regression

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sebastian Gombert , Sabine Bartsch

Topics

Machine Learning > Core Methods > Regression Machine Learning > Optimization & Theory > Optimization Natural Language Processing > Applications > Text Classification Natural Language Processing > Resources & Methods > Text Representation

Keywords

lexical complexity gradient boosting feature engineering word embedding gradient boosted regression tree feature ensemble contextual embedding linguistic feature contextualized embedding contextualized word embedding heterogeneous feature psycholinguistic norm lexical complexity prediction gradient boosted regression heterogeneous feature set regression tree ensemble

Download PDF

Related papers

Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training 2021

A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification 2021

How Did This Get Funded?! Automatically Identifying Quirky Scientific Achievements 2021

Exploring Discourse Structures for Argument Impact Classification 2021

Language Embeddings for Typology and Cross-lingual Transfer Learning 2021