DocSplit: Simple Contrastive Pretraining for Large Document Embeddings

Yujie Wang; Mike Izbicki

2023 EMNLP EMNLP 2023

DocSplit: Simple Contrastive Pretraining for Large Document Embeddings

Abstract

AbstractExisting model pretraining methods only consider local information. For example, in the popular token masking strategy, the words closer to the masked token are more important for prediction than words far away. This results in pretrained models that generate high-quality sentence embeddings, but low-quality embeddings for large documents. We propose a new pretraining method called DocSplit which forces models to consider the entire global context of a large document. Our method uses a contrastive loss where the positive examples are randomly sampled sections of the input document, and negative examples are randomly sampled sections of unrelated documents. Like previous pretraining methods, DocSplit is fully unsupervised, easy to implement, and can be used to pretrain any model architecture. Our experiments show that DocSplit outperforms other pretraining methods for document classification, few shot learning, and information retrieval tasks.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yujie Wang , Mike Izbicki

Topics

Machine Learning > Learning Types > Contrastive Learning Deep Learning > Techniques > Pretraining Natural Language Processing > Resources & Methods > Text Representation Data Science & Analytics > Applications > Information Retrieval Deep Learning > Techniques > Contrastive Learning Deep Learning > Learning Types > Contrastive Learning Deep Learning > Learning Types > Representation Learning Natural Language Processing > Resources & Methods > Pretraining

Keywords

representation learning contrastive learning few-shot learning self-supervised learning information retrieval text representation document embedding

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023