2025 NAACL NAACL 2025

A Bit of This, a Bit of That: Building a Genre and Topic Annotated Dataset of Historical Newspaper Articles with Soft Labels and Confidence Scores

Abstract

AbstractDigitised historical newspaper collections are becoming increasingly accessible, yet their scale and diverse content still present challenges for researchers interested in specific article types or topics. In a step towards developing models to address these challenges, we have created a dataset of articles from New Zealand’s Papers Past open data annotated with multiple genre and topic labels and annotator confidence scores. Our annotation framework aligns with the perspectivist approach to machine learning, acknowledging the subjective nature of the task and embracing the hybridity and uncertainty of genres. In this paper, we describe our sampling and annotation methods and the resulting dataset of 7,036 articles from 106 New Zealand newspapers spanning the period 1839-1903. This dataset will be used to develop interpretable classification models that enable fine-grained exploration and discovery of articles in Papers Past newspapers based on common aspects of form, function, and topic. The complete dataset, including un-aggregated annotations and supporting documentation, will eventually be openly released to facilitate further research.

🌉 Interdisciplinary Bridge — Interdisciplinary and Machine Learning and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio