Synthesizing question answering data from financial documents: An End-to-End Multi-Agent Approach
Abstract
AbstractAnswering complex questions that require numerical reasoning over financial documents is challenging due to the diverse and scatterednature of relevant information. While large language models (LLMs) excel at financial reasoning, their enterprise deployment is often limited by cost and latency. Small language models (SLMs) present a cost-effective alternative but need to be fine-tuned with high-quality, domain-specific question-answer (QA) data. Acquiring such data requires manual expert annotation, presenting a bottleneck to the wider application of SLMs.This work introduces a modular, scalable end-to-end agentic pipeline that extracts and selects relevant content from unstructured financial documents and then generates QA pairs from the selected content for SLM fine-tuning. Compared to the same models trained on previous manually generated data for the task, one of the models trained on our pipeline-produced synthetic data achieved competitive in-distribution performance, and all tested models demonstrated superior generalization. The framework thus demonstrates considerable potential to accelerate the deployment of smaller, cost-effective models by reducing manual data creation efforts.