Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval

João Coelho; Bruno Martins; Joao Magalhaes; Jamie Callan; Chenyan Xiong

2024 ACL ACL 2024

Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval

Abstract

AbstractThis study investigates the existence of positional biases in Transformer-based language models for text representation learning, particularly in the context of web document retrieval. We build on previous research that demonstrated loss of information in the middle of input sequences for causal language models, extending it to the domain of embedding learning. We examine positional biases at multiple stages of the training pipeline for an encoder-decoder neural retrieval model, namely language model pre-training, contrastive pre-training, and contrastive fine-tuning. Experiments with the MS-MARCO document collection reveal that after contrastive pre-training the model already generates embeddings that better capture the beginning of the input content, with fine-tuning further aggravating this effect.

🌉 Interdisciplinary Bridge — Computer Science and Deep Learning and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — document retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

João Coelho , Bruno Martins , Joao Magalhaes , Jamie Callan , Chenyan Xiong

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Natural Language Processing > Applications > Information Retrieval Computer Science > Applications > Information Retrieval Machine Learning > Learning Types > Representation Learning Deep Learning > Techniques > Contrastive Learning Deep Learning > Learning Types > Contrastive Learning Deep Learning > Learning Types > Representation Learning

Keywords

contrastive learning information retrieval document retrieval dense retrieval text embedding document embedding positional bia transformer model

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024