GPT-4 is Judged More Human than Humans in Displaced and Inverted Turing Tests

Ishika M. Rathi; Sydney Taylor; Benjamin Bergen; Cameron Jones

2025 COLING COLING 2025

GPT-4 is Judged More Human than Humans in Displaced and Inverted Turing Tests

Abstract

AbstractEveryday AI detection requires differentiating between humans and AI in informal, online conversations. At present, human users most often do not interact directly with bots but instead read their conversations with other humans. We measured how well humans and large language models can discriminate using two modified versions of the Turing test: inverted and displaced. GPT-3.5, GPT-4, and displaced human adjudicators judged whether an agent was human or AI on the basis of a Turing test transcript. We found that both AI and displaced human judges were less accurate than interactive interrogators, with below chance accuracy overall. Moreover, all three judged the best-performing GPT-4 witness to be human more often than human witnesses. This suggests that both humans and current LLMs struggle to distinguish between the two when they are not actively interrogating the person, underscoring an urgent need for more accurate tools to detect AI in conversations.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — human-ai distinction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ishika M. Rathi , Sydney Taylor , Benjamin Bergen , Cameron Jones

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Human-AI Interaction Machine Learning > Core Methods > Classification Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Evaluation

Keywords

conversational ai model evaluation human-ai interaction ai detection turing test human-ai distinction interactive interrogator large language model

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025