Papers
33 papers found
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally et al.
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
Lin Shi, Chiyu Ma, Wenhua Liang et al.
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera, Odellia Boni, Yotam Perlitz et al.
CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?
Yuwei Zhao, Ziyang Luo, Yuchen Tian et al.
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
Noy Sternlicht, Ariel Gera, Roy Bar-Haim et al.
Fooling the LVLM Judges: Visual Biases in LVLM-Based Evaluation
Yerin Hwang, Dongryeol Lee, Kyungmin Min et al.
CourtReasoner: Can LLM Agents Reason Like Judges?
Sophia Simeng Han, Yoshiki Takashima, Shannon Zejiang Shen et al.
Audio-Aware Large Language Models as Judges for Speaking Styles
Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin et al.
Can You Trick the Grader? Adversarial Persuasion of LLM Judges
Yerin Hwang, Dongryeol Lee, Taegwan Kang et al.
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Lianghui Zhu, Xinggang Wang, Xinlong Wang
JudgeBench: A Benchmark for Evaluating LLM-Based Judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
Lin Shi, Chiyu Ma, Wenhua Liang et al.
Becoming Experienced Judges: Selective Test-Time Learning for Evaluators
Seungyeon Jwa, Daechul Ahn, Reokyoung Kim et al.
Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
Jiwon Moon, Yerin Hwang, Dongryeol Lee et al.
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils et al.
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng et al.
EvalAssist: LLM-as-a-Judge Simplified
Michael Desmond, Zahra Ashktorab, Werner Geyer et al.
Humans or LLMs as the Judge? A Study on Judgement Bias
Guiming Hardy Chen, Shunian Chen, Ziche Liu et al.
Can LLM be a Personalized Judge?
Yijiang River Dong, Tiancheng Hu, Nigel Collier
Direct Judgement Preference Optimization
PeiFeng Wang, Austin Xu, Yilun Zhou et al.
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li, Bohan Jiang, Liangjie Huang et al.
Improve LLM-as-a-Judge Ability as a General Ability
Jiachen Yu, Shaoning Sun, Xiaohui Hu et al.
MR. Judge: Multimodal Reasoner as a Judge
Renjie Pi, Haoping Bai, Qibin Chen et al.
Agent-as-Judge for Factual Summarization of Long Narratives
Yeonseok Jeong, Minsoo Kim, Seung-won Hwang et al.
How Reliable is Multilingual LLM-as-a-Judge?
Xiyan Fu, Wei Liu