Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Noam Dahan; Omer Kidron; Gabriel Stanovsky

2026 EACL EACL 2026

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Abstract

AbstractHigh quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — automatic pipeline

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Noam Dahan , Omer Kidron , Gabriel Stanovsky

Topics

Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Generation > Summarization Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

low-resource language multi-document summarization data collection historical newspaper automatic pipeline

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026