Chinese Lexical Substitution: Dataset and Method

Jipeng Qiang; Kang Liu; Ying Li; Yun Li; Yi Zhu; Yun-Hao Yuan; Xiaocheng Hu; Xiaoye Ouyang

2023 EMNLP EMNLP 2023

Chinese Lexical Substitution: Dataset and Method

Abstract

AbstractExisting lexical substitution (LS) benchmarks were collected by asking human annotators to think of substitutes from memory, resulting in benchmarks with limited coverage and relatively small scales. To overcome this problem, we propose a novel annotation method to construct an LS dataset based on human and machine collaboration. Based on our annotation method, we construct the first Chinese LS dataset CHNLS which consists of 33,695 instances and 144,708 substitutes, covering three text genres (News, Novel, and Wikipedia). Specifically, we first combine four unsupervised LS methods as an ensemble method to generate the candidate substitutes, and then let human annotators judge these candidates or add new ones. This collaborative process combines the diversity of machine-generated substitutes with the expertise of human annotators. Experimental results that the ensemble method outperforms other LS methods. To our best knowledge, this is the first study for the Chinese LS task.

🌉 Interdisciplinary Bridge — Deep Learning and Interdisciplinary and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — human annotation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jipeng Qiang , Kang Liu , Ying Li , Yun Li , Yi Zhu , Yun-Hao Yuan , Xiaocheng Hu , Xiaoye Ouyang

Topics

Machine Learning > Learning Types > Unsupervised Learning Natural Language Processing > Understanding > Semantic Analysis Natural Language Processing > Resources & Methods > Lexical Semantics Interdisciplinary > Linguistics > Computational Linguistics Machine Learning > Core Methods > Feature Learning Natural Language Processing > Applications > Text Generation Deep Learning > Learning Types > Self-Supervised Learning Natural Language Processing > Understanding > Lexical Semantics

Keywords

text annotation semantic analysis ensemble method human annotation chinese nlp lexical substitution candidate generation candidate selection

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023