One of these words is not like the other: a reproduction of outlier identification using non-contextual word representations

Jesper Brink Andersen; Mikkel Bak Bertelsen; Mikkel Hørby Schou; Manuel R. Ciosici; Ira Assent

2020 EMNLP EMNLP 2020

One of these words is not like the other: a reproduction of outlier identification using non-contextual word representations

Abstract

AbstractWord embeddings are an active topic in the NLP research community. State-of-the-art neural models achieve high performance on downstream tasks, albeit at the cost of computationally expensive training. Cost aware solutions require cheaper models that still achieve good performance. We present several reproduction studies of intrinsic evaluation tasks that evaluate non-contextual word representations in multiple languages. Furthermore, we present 50-8-8, a new data set for the outlier identification task, which avoids limitations of the original data set, such as ambiguous words, infrequent words, and multi-word tokens, while increasing the number of test cases. The data set is expanded to contain semantic and syntactic tests and is multilingual (English, German, and Italian). We provide an in-depth analysis of word embedding models with a range of hyper-parameters. Our analysis shows the suitability of different models and hyper-parameters for different tasks and the greater difficulty of representing German and Italian languages.

🌉 Interdisciplinary Bridge — Interdisciplinary and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — outlier identification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jesper Brink Andersen , Mikkel Bak Bertelsen , Mikkel Hørby Schou , Manuel R. Ciosici , Ira Assent

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Core Methods > Embedding Learning Natural Language Processing > Resources & Methods > Multilingual NLP Natural Language Processing > Resources & Methods > Text Representation Interdisciplinary > Linguistics > Computational Linguistics

Keywords

multilingual nlp semantic similarity word embedding intrinsic evaluation outlier identification non-contextual representation

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020