Evaluating Health Question Answering Under Readability-Controlled Style Perturbations

Md Mushfiqur Rahman; Kevin Lybarger

2025 EMNLP EMNLP 2025

Evaluating Health Question Answering Under Readability-Controlled Style Perturbations

Abstract

AbstractPatients often ask semantically similar medical questions in linguistically diverse ways that vary in readability tone and background knowledge. A robust question answering QA system should both provide semantically consistent answers across stylistic differences and adapt its response style to match the users input however existing QA evaluations rarely test this capability creating critical gaps in QA evaluation that undermine accessibility and health literacy. We introduce SPQA an evaluation framework and benchmark that applies controlled stylistic perturbations to consumer health questions while preserving semantic intent then measures how model answers change across correctness completeness coherence fluency and linguistic adaptability using a human-validated LLM-based judge. The style axes include reading level formality and patient background knowledge all perturbations are grounded in human annotations to ensure fidelity and alignment with human judgments. Our contributions include a readability-aware evaluation methodology a style-diverse benchmark with human-grounded perturbations and an automated evaluation pipeline validated against expert judgments. Evaluation results across multiple health QA models indicate that stylistic perturbations lead to measurable performance degradation even when semantic intent is preserved during perturbation. The largest performance drops occur in answer correctness and completeness while models also show limited ability to adapt their style to match the input. These findings underscore the risk of inequitable information delivery and highlight the need for accessibility-aware QA evaluation.

🌉 Interdisciplinary Bridge — Healthcare & Medicine and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — linguistic adaptability

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Md Mushfiqur Rahman , Kevin Lybarger

Topics

Machine Learning > Application Areas > Fairness Natural Language Processing > Applications > Question Answering Healthcare & Medicine > Clinical > Medical AI

Keywords

benchmark evaluation style adaptation health question answering style perturbation linguistic adaptability readability-controlled perturbation llm-based judge stylistic perturbation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025