Inducing a lexicon of sociolinguistic variables from code-mixed text

Philippa Shoemark; James Kirby; Sharon Goldwater

2018 EMNLP EMNLP 2018

Inducing a lexicon of sociolinguistic variables from code-mixed text

Abstract

AbstractSociolinguistics is often concerned with how variants of a linguistic item (e.g., nothing vs. nothin’) are used by different groups or in different situations. We introduce the task of inducing lexical variables from code-mixed text: that is, identifying equivalence pairs such as (football, fitba) along with their linguistic code (football→British, fitba→Scottish). We adapt a framework for identifying gender-biased word pairs to this new task, and present results on three different pairs of English dialects, using tweets as the code-mixed text. Our system achieves precision of over 70% for two of these three datasets, and produces useful results even without extensive parameter tuning. Our success in adapting this framework from gender to language variety suggests that it could be used to discover other types of analogous pairs as well.

🧭 Keyword Pioneer — sociolinguistic variable

🐣 Hot Topic Early Bird — code-mixed text

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Philippa Shoemark , James Kirby , Sharon Goldwater

Topics

Natural Language Processing > Resources & Methods > Lexical Semantics Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

code-mixed text lexicon induction dialect identification sociolinguistic variable lexical variable

Download PDF

Related papers

Speeding Up Neural Machine Translation Decoding by Cube Pruning 2018

Limitations in learning an interpreted language with recurrent models 2018

Results of the sixth edition of the BioASQ Challenge 2018

Neural Segmental Hypergraphs for Overlapping Mention Recognition 2018

Hybrid Neural Attention for Agreement/Disagreement Inference in Online Debates 2018