LaTeX2Layout: High-Fidelity, Scalable Document Layout Annotation Pipeline for Layout Detection

Feijiang Han; Zelong Wang; Bowen Wang; Xinxin Liu; Skyler Cheung; Delip Rao; Chris Callison-Burch; Lyle Ungar

2026 AAAI AAAI 2026

LaTeX2Layout: High-Fidelity, Scalable Document Layout Annotation Pipeline for Layout Detection

Abstract

Abstract General-purpose Vision-Language Models (VLMs) are increasingly integral to modern AI systems for document understanding, yet their ability to perform fine-grained layout analysis remains severely underdeveloped. Overcoming this limitation requires large-scale, high-fidelity training datasets. However, current annotation methods that rely on parsing rendered PDFs are costly, error-prone, and difficult to scale. We propose a different paradigm: extracting ground-truth layout directly from the LaTeX compilation process rather than the final PDF. We present LaTeX2Layout, a generalizable procedural pipeline that recovers pixel-accurate bounding boxes and reading order from compiler traces. This enables the generation of a 140K-page dataset, including 120K programmatically generated synthetic variants that more than double the layout diversity of real-world data. Using this dataset, we fine-tune an efficient 3B-parameter VLM with an easy-to-hard curriculum that accelerates convergence. Our model achieves Kendall's tau=0.95 for reading order and mAP@50=0.91 for element grounding, delivering nearly 200% relative improvement over strong zero-shot baselines such as GPT-4o and Claude-3.7.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Feijiang Han , Zelong Wang , Bowen Wang , Xinxin Liu , Skyler Cheung , Delip Rao , Chris Callison-Burch , Lyle Ungar

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Techniques > Pretraining

Keywords

curriculum learning synthetic data generation vision-language model document layout layout detection

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026