mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Xin Zhang; Yanzhao Zhang; Dingkun Long; Wen Xie; Ziqi Dai; Jialong Tang; Huan Lin; Baosong Yang; Pengjun Xie; Fei Huang; Meishan Zhang; Wenjie Li; Min Zhang

2024 EMNLP EMNLP 2024

mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Abstract

AbstractWe present systematic efforts in building long-context multilingual text representation model (TRM) and reranker from scratch for text retrieval. We first introduce a text encoder (base size) enhanced with RoPE and unpadding, pre-trained in a native 8192-token context (longer than 512 of previous multilingual encoders). Then we construct a hybrid TRM and a cross-encoder reranker by contrastive learning. Evaluations show that our text encoder outperforms the same-sized previous state-of-the-art XLM-R. Meanwhile, our TRM and reranker match the performance of large-sized state-of-the-art BGE-M3 models and achieve better results on long-context retrieval benchmarks. Further analysis demonstrate that our proposed models exhibit higher efficiency during both training and inference. We believe their efficiency and effectiveness could benefit various researches and industrial applications.

🌉 Interdisciplinary Bridge — Deep Learning and Natural Language Processing

🐣 Hot Topic Early Bird — multilingual retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xin Zhang , Yanzhao Zhang , Dingkun Long , Wen Xie , Ziqi Dai , Jialong Tang , Huan Lin , Baosong Yang , Pengjun Xie , Fei Huang , Meishan Zhang , Wenjie Li , Min Zhang

Topics

Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Resources & Methods > Text Representation Deep Learning > Learning Types > Contrastive Learning Deep Learning > Models > Language Models

Keywords

contrastive learning multilingual retrieval text representation text retrieval multilingual model long-context model long-context transformer text reranking

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024