FormGym: Doing Paperwork with Agents

Matthew Toles; Isaac Song; Rattandeep Singh; Zhou Yu

2026 EACL EACL 2026

FormGym: Doing Paperwork with Agents

Abstract

AbstractEnd-to-end form filling refers to automatically populating fields in a document-style form with the appropriate information derived from external data. Although prevalent and useful, no formal benchmark exists for evaluating systems’ form completion accuracy. Existing datasets focus on parsing, extraction and web form interaction, rather than end-to-end completion of document-style forms. We propose FormGym, a benchmark formulation of the end-to-end form filling task that evaluates form completion and accuracy. We construct FormGym by repurposing three existing datasets and add one new dataset to achieve more challenging, diverse, and realistic test cases. Our studies show baseline vision language agents (VLAs) perform poorly on FormGym in every scenario, primarily due to poor field localization. GUI agents perform better but suffer from high latency and costs. Therefore we also introduce FieldFinder, a field localization tool that enables zero-shot VLAs to find and accurately place text in input fields. We find that VLAs augmented with FieldFinder achieve better performance compared to baselines in all models.

🧭 Keyword Pioneer — vision-language agent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Matthew Toles , Isaac Song , Rattandeep Singh , Zhou Yu

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Multimodal Learning

Keywords

zero-shot learning benchmark evaluation document understanding form filling vision-language agent field localization

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026