CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Brihi Joshi; Sriram Venkatapathy; Mohit Bansal; Nanyun Peng; Haw-Shiuan Chang

2025 ACL ACL 2025

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Abstract

AbstractEvaluating creative text such as human-written stories using language models has always been a challenging task – owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (Wei et al., 2023) (CoT) generates free-text explanations that help guide a model’s predictions and Self-Consistency (Wang et al., 2022) (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating ‘fluent-looking’ explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose Chain-of-Keywords (CoKe), which generates a sequence of keywords before generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less # of parameters.

🧭 Keyword Pioneer — story evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Healthcare & Medicine, Knowledge & Reasoning, Machine Learning, Natural Language Processing

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

Authors

Brihi Joshi , Sriram Venkatapathy , Mohit Bansal , Nanyun Peng , Haw-Shiuan Chang

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Applications > Summarization Deep Learning > Models > Large Language Models Natural Language Processing > Applications > Text Generation Machine Learning > Learning Types > Evaluation Artificial Intelligence > Core AI > Natural Language Generation

Keywords

chain-of-thought reasoning language model evaluation language model keyword extraction fine-grained evaluation fine-grained assessment story evaluation keyword generation keyword rationalization

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

Abstract

Authors

Topics

Keywords

Related papers