DNLP@FinTOC’20: Table of Contents Detection in Financial Documents

Dijana Kosmajac; Stacey Taylor; Mozhgan Saeidi

2020 COLING COLING 2020

DNLP@FinTOC’20: Table of Contents Detection in Financial Documents

Abstract

AbstractTitle Detection and Table of Contents Generation are important components in detecting document structure. In particular, these two elements serve to provide the skeleton of the document, providing users with an understanding of organization, as well as the relevance of information, and where to find information within the document. Here, we show that using tesseract with Levenstein distance, a feature set inspired by Alk et al., we were able to correctly classify the title to an F1 measure 0.73 and 0.87, and the table-of-contents to a harmonic mean of 0.36 and 0.39, in English and French respectively. Our methodology works with both PDF and scanned documents, giving it a wide range of applicability within the document engineering and storage domains.

🧭 Keyword Pioneer — table of contents detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Dijana Kosmajac , Stacey Taylor , Mozhgan Saeidi

Topics

Machine Learning > Core Methods > Classification

Keywords

text classification title detection document classification financial document table of contents detection

Download PDF

Related papers

Persuasiveness of News Editorials depending on Ideology and Personality 2020

A Graph Representation of Semi-structured Data for Web Question Answering 2020

Span-based Joint Entity and Relation Extraction with Attention-based Span-specific and Contextual Semantic Representations 2020

Hierarchical Chinese Legal event extraction via Pedal Attention Mechanism 2020

End-to-End Emotion-Cause Pair Extraction with Graph Convolutional Network 2020