The Labeled Segmentation of Printed Books

Lara McConnaughey; Jennifer Dai; David Bamman

2017 EMNLP EMNLP 2017

The Labeled Segmentation of Printed Books

Abstract

AbstractWe introduce the task of book structure labeling: segmenting and assigning a fixed category (such as Table of Contents, Preface, Index) to the document structure of printed books. We manually annotate the page-level structural categories for a large dataset totaling 294,816 pages in 1,055 books evenly sampled from 1750-1922, and present empirical results comparing the performance of several classes of models. The best-performing model, a bidirectional LSTM with rich features, achieves an overall accuracy of 95.8 and a class-balanced macro F-score of 71.4.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Science and Deep Learning and Machine Learning

🧭 Keyword Pioneer — book structure labeling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio

Authors

Lara McConnaughey , Jennifer Dai , David Bamman

Topics

Machine Learning > Core Methods > Classification Deep Learning > Architectures > Neural Networks Computer Science > Applications > Document Analysis Artificial Intelligence > Core AI > Natural Language Processing

Keywords

bidirectional lstm document segmentation book structure labeling structural classification printed book page-level classification book structure

Download PDF

Related papers

Reinforced Video Captioning with Entailment Rewards 2017

Cross-lingual Character-Level Neural Morphological Tagging 2017

Inter-Weighted Alignment Network for Sentence Pair Modeling 2017

Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings 2017

An Empirical Analysis of Edit Importance between Document Versions 2017