SynthNID: Synthetic Data to Improve End-to-end Bangla Document Key Information Extraction

Syed Mostofa Monsur; Shariar Kabir; Sakib Chowdhury

2023 EMNLP EMNLP 2023

SynthNID: Synthetic Data to Improve End-to-end Bangla Document Key Information Extraction

Abstract

AbstractEnd-to-end Document Key Information Extraction models require a lot of compute and labeled data to perform well on real datasets. This is particularly challenging for low-resource languages like Bangla where domain-specific multimodal document datasets are scarcely available. In this paper, we have introduced SynthNID, a system to generate domain-specific document image data for training OCR-less end-to-end Key Information Extraction systems. We show the generated data improves the performance of the extraction model on real datasets and the system is easily extendable to generate other types of scanned documents for a wide range of document understanding tasks. The code for generating synthetic data is available at https://github.com/dv66/synthnid

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Syed Mostofa Monsur , Shariar Kabir , Sakib Chowdhury

Topics

Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Applications > Information Extraction Computer Vision > Domain-Specific > Document Analysis

Keywords

domain adaptation document understanding synthetic data generation end-to-end learning end-to-end model optical character recognition key information extraction

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023