Back in 1950, Alan Turing proposed an elegantly simple yet profoundly challenging way to determine whether machines could be said to "think." Known as the Turing Test, this measure of machine intelligence sets humans and machines in conversational competition, challenging human judges to distinguish between artificial and genuine intelligence through text-based interactions.
Despite numerous attempts, no artificial system had ever convincingly passed this test. Until now.
Cameron Jones and Benjamin Bergen from the University of California, San Diego, have gathered for the first time empirical evidence that OpenAI’s GPT-4.5, a sophisticated large language model (LLM), has successfully passed the Turing Test. Not only did GPT-4.5 pass, but under specific conditions, it outperformed human counterparts in convincing participants of its humanity.
The work has significant implications for society, ethics and humanity’s understanding of intelligence itself.
Human v Machine
The Turing Test pits an interrogator against two conversational partners — one human and one machine — engaging both simultaneously via text. The interrogator's task is simple: to decide which conversational partner is human. But despite rapid advancements in computational linguistics and machine learning, AI systems have consistently failed this seemingly straightforward challenge.
To take the test, Jones and Bergen recruited 284 participants from diverse backgrounds, including undergraduate students and paid crowd-sourced workers from Prolific, a tech company that employs humans to take part in AI-related experiments.
The researchers pitted several contemporary AI models against humans — including GPT-4o, LLaMa-3.1-405B, and GPT-4.5. Interestingly, GPT-4.5 emerged as the winner but only when instructed to adopt a "humanlike persona." In those conversations, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant, say Jones and Bergen, highlighting a remarkable shift in AI’s ability to emulate human conversation.
The researchers also used an older rule-based chatbot called ELIZA to generate text and this was readily identified as a machine by the judges. Similarly, GPT-4o, a previous generation model, also significantly underperformed, chosen as human in just 21% of cases. "The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test," say Jones and Bergen.
Jones and Bergen attribute part of GPT4.5’s success to the careful crafting of prompts designed to guide the model into adopting a persona that humans find relatable and convincingly authentic — specifically, a persona of an introverted young person fluent in internet slang and culture. GPT4.5’s ability to do this, say the researchers, demonstrates nuanced command over language patterns and interactive subtleties previously thought uniquely human.
"It is arguably the ease with which LLMs can be prompted to adapt their behavior to different scenarios that makes them so flexible: and apparently so capable of passing as human," say Jones and Bergen. This adaptability, rather than being a weakness, is precisely what underscores their emerging intelligence.
Of course, the work also raises the thorny question of whether the Turing Test is measuring intelligence at all or just measuring the ability to pass the test. Either way, the success of GPT-4.5 challenges the conventional wisdom that genuine intelligence must include conscious awareness or deep comprehension. It may even prompt a reevaluation of the criteria used to define cognitive abilities and intellect.
Evolving Intelligence
That’s an impressive result with significant ethical, economic and social implications. "Models with this ability to robustly deceive and masquerade as people could be used for social engineering or to spread misinformation," say the researchers, warning of the potential misuse of “counterfeit humans” in politics, marketing and cybersecurity.
But there is also a clear upside, albeit with important caveats. Better conversational agents could significantly enhance human-computer interactions, improve automated services, virtual assistance, companionship and educational tools. Achieving a balance between utility and risk will probably require carefully considered regulation.
The work may also force humans to change how they interact with each other. Jones and Bergen imagine a greater cultural emphasis on authentic human interaction, spurred by the ubiquity of capable AI counterparts.
This blurring of the distinction between machines and humans would surely have fascinated even Turing himself.
Ref: Large Language Models Pass the Turing Test : arxiv.org/abs/2503.23674