Evaluating Large Language Models for Narrative Topic Labeling
Abstract
AbstractThis paper evaluates the effectiveness of large language models (LLMs) for labeling topics in narrative texts, comparing performance across fiction and news genres. Building on prior studies in factual documents, we extend the evaluation to narrative contexts where story content is central. Using a ranked voting system with 200 crowdworkers, we assess participants’ preferences of topic labels by comparing multiple LLM outputs with human annotations. Our findings indicate minimal inter-model variation, with LLMs performing on par with human readers in news and outperforming humans in fiction. We conclude with a case study using a set of 25,000 narrative passages from novels illustrating the analytical value of LLM topic labels compared to traditional methods. The results highlight the significant promise of LLMs for topic labeling of narrative texts.