From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Karen Zhou; John Michael Giorgi; Pranav Mani; Peng Xu; Davis Liang; Chenhao Tan

2025 EMNLP EMNLP 2025

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes

Abstract

AbstractAI-generated clinical notes are increasingly used in healthcare, but evaluating their quality remains a challenge due to high subjectivity and limited scalability of expert review. Existing automated metrics often fail to align with real-world physician preferences. To address this, we propose a pipeline that systematically distills real user feedback into structured checklists for note evaluation. These checklists are designed to be interpretable, grounded in human feedback, and enforceable by LLM-based evaluators. Using deidentified data from over 21,000 clinical encounters (prepared in accordance with the HIPAA safe harbor standard) from a deployed AI medical scribe system, we show that our feedback-derived checklist outperforms a baseline approach in our offline evaluations in coverage, diversity, and predictive power for human ratings. Extensive experiments confirm the checklist’s robustness to quality-degrading perturbations, significant alignment with clinician preferences, and practical value as an evaluation methodology. In offline research settings, our checklist offers a practical tool for flagging notes that may fall short of our defined quality standards.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Healthcare & Medicine and Knowledge & Reasoning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — note quality assessment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Karen Zhou , John Michael Giorgi , Pranav Mani , Peng Xu , Davis Liang , Chenhao Tan

Topics

Natural Language Processing > Applications > Text Classification Knowledge & Reasoning > Reasoning > Formal Methods Healthcare & Medicine > Clinical > Clinical NLP Artificial Intelligence > Core AI > Large Language Models Healthcare & Medicine > Clinical > Medical AI Machine Learning > Learning Types > Evaluation Natural Language Processing > Applications > Clinical NLP

Keywords

natural language generation evaluation methodology clinical note quality assessment medical ai large language model llm-based evaluation healthcare evaluation checklist generation note quality assessment checklist evaluation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025