XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

Yaobo Liang; Nan Duan; Yeyun Gong; Ning Wu; Fenfei Guo; Weizhen Qi; Ming Gong; Linjun Shou; Daxin Jiang; Guihong Cao; Xiaodong Fan; Ruofei Zhang; Rahul Agrawal; Edward Cui; Sining Wei; Taroon Bharti; Ying Qiao; Jiun-Hung Chen; Winnie Wu; Shuguang Liu; Fan Yang; Daniel Campos; Rangan Majumder; Ming Zhou

2020 EMNLP EMNLP 2020

XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

Abstract

AbstractIn this paper, we introduce XGLUE, a new benchmark dataset to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE (Wang et al.,2019), which is labeled in English and includes natural language understanding tasks only, XGLUE has three main advantages: (1) it provides two corpora with different sizes for cross-lingual pre-training; (2) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (3) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder (Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison.

👥 Mega-Team — 24 authors

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

📈 Trend Setter — Understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yaobo Liang , Nan Duan , Yeyun Gong , Ning Wu , Fenfei Guo , Weizhen Qi , Ming Gong , Linjun Shou , Daxin Jiang , Guihong Cao , Xiaodong Fan , Ruofei Zhang , Rahul Agrawal , Edward Cui , Sining Wei , Taroon Bharti , Ying Qiao , Jiun-Hung Chen , Winnie Wu , Shuguang Liu , Fan Yang , Daniel Campos , Rangan Majumder , Ming Zhou

Topics

Natural Language Processing > Understanding Natural Language Processing > Generation > Language Modeling Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Transfer Learning Natural Language Processing > Applications > Natural Language Understanding

Keywords

natural language generation natural language understanding benchmark dataset pre-trained model multilingual model cross-lingual pre-training cross-lingual benchmark

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020