Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt

Peter Anderson; Mano Vikash Janardhanan; Jason He; Wei Cheng; Charlie Flanagan

2024 EMNLP EMNLP 2024

Greenback Bears and Fiscal Hawks: Finance is a Jungle and Text Embeddings Must Adapt

Abstract

AbstractFinancial documents are filled with specialized terminology, arcane jargon, and curious acronyms that pose challenges for general-purpose text embeddings. Yet, few text embeddings specialized for finance have been reported in the literature, perhaps in part due to a lack of public datasets and benchmarks. We present BAM embeddings, a set of text embeddings finetuned on a carefully constructed dataset of 14.3M query-passage pairs including both public and proprietary financial documents. Demonstrating the benefits of domain-specific training, BAM embeddings achieve Recall@1 of 62.8% on a held-out test set, vs. only 39.2% for the best general-purpose text embedding from OpenAI. Further, BAM embeddings increase question answering accuracy by 8% on FinanceBench and show increased sensitivity to the finance-specific elements that are found in detailed, forward-looking and company and date-specific queries. To support further research we describe our approach in detail, quantify the importance of hard negative mining and dataset scale, and publicly release our embeddings.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — hard negative mining

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Peter Anderson , Mano Vikash Janardhanan , Jason He , Wei Cheng , Charlie Flanagan

Topics

Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Resources & Methods > Text Representation Machine Learning > Learning Types > Domain Adaptation Deep Learning > Learning Types > Representation Learning

Keywords

domain adaptation question answering information retrieval text embedding hard negative mining financial document

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024