CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models

Paul Grundmann; Jan Frick; Dennis Fast; Thomas Steffek; Felix Gers; Wolfgang Nejdl; Alexander Löser

2026 EACL EACL 2026

CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models

Abstract

AbstractWith their growing capabilities, generative large language models (LLMs) are being increasingly investigated for complex medical tasks.However, their effectiveness in real-world clinical applications remains underexplored. To address this, we present CliniBench, the first benchmark that enables comparability of well-studied encoder-based classifiers and generative LLMs for discharge diagnosis prediction from admission notes in the MIMIC-IV dataset. Our extensive study compares 12 generative LLMs and 3 encoder-based classifiers and demonstrates that encoder-based classifiers consistently outperform generative models in diagnosis prediction. We assess several retrieval augmentation strategies for in-context learning from similar patients and find that they provide notable performance improvements for generative LLMs.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Paul Grundmann , Jan Frick , Dennis Fast , Thomas Steffek , Felix Gers , Wolfgang Nejdl , Alexander Löser

Topics

Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Generation > Language Modeling

Keywords

retrieval augmented generation in-context learning clinical outcome prediction language model encoder-based classifier

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026