DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

Sungnyun Kim; Haofu Liao; Srikar Appalaraju; Peng Tang; Zhuowen Tu; Ravi Kumar Satzoda; R. Manmatha; Vijay Mahadevan; Stefano Soatto

2024 EMNLP EMNLP 2024

DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models

Abstract

AbstractVisual document understanding (VDU) is a challenging task that involves understanding documents across various modalities (text and image) and layouts (forms, tables, etc.). This study aims to enhance generalizability of small VDU models by distilling knowledge from LLMs. We identify that directly prompting LLMs often fails to generate informative and useful data. In response, we present a new framework (called DocKD) that enriches the data generation process by integrating external document knowledge. Specifically, we provide an LLM with various document elements like key-value pairs, layouts, and descriptions, to elicit open-ended answers. Our experiments show that DocKD produces high-quality document annotations and surpasses the direct knowledge distillation approach that does not leverage external document knowledge. Moreover, student VDU models trained with solely DocKD-generated data is not only comparable to those trained with human-annotated data on in-domain tasks but also significantly excel them on out-of-domain tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sungnyun Kim , Haofu Liao , Srikar Appalaraju , Peng Tang , Zhuowen Tu , Ravi Kumar Satzoda , R. Manmatha , Vijay Mahadevan , Stefano Soatto

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Knowledge Distillation Computer Vision > Domain-Specific > Document Analysis Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Knowledge Distillation

Keywords

knowledge distillation multimodal learning visual document understanding document annotation open-world learning large language model

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024