Summary

UC San Diego released a paper that potentially provides the first robust evidence that an AI system has passed the legendary Turing test. So what does this mean? How was the test conducted? Let’s unpack this milestone and its implications for our digital lives.

What Is the Turing Test?

The legendary mathematicianAlan Turingproposed a method to evaluate whether a machine’s intelligence is on par with that of a human. This is known as the famousTuring test. In its simplest form, the test involves a human judge engaging in a text-based conversation with both a human and a machine, without knowing which is which. If the judge cannot reliably tell the machine from the human, the machine passes the test.

Initially, people considered that if a machine convinces human interrogators 30% of the time that they’re talking to another human, it has passed the test. However, many argue that a 50% pass rate—where judges perform no better than chance at identifying the AI—is a more meaningful threshold.

Messaging App Interface from the UCSD Turning Test

Over the years, there have been numerous claims that AI chatbots have passed the Turing test, but those tests lack sufficient rigor and controls to actually convince AI scientists. The UC San Diego study aimed to address these shortcomings and provide a more definitive evaluation.

The Turing Test Conducted by UC San Diego

Cameron R. Jones and Benjamin K. Bergen from the Department of Cognitive Science, UC San Diego, published a paper titledPeople cannot distinguish GPT-4 from a human in a Turing test.They conducted a version of the Turing test that stands out for its rigorous methodology and scale. As per the paper:

The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test.

Graph of strategies used by interrogators in turing test vs their accuracy

Here’s an overview of how the experiment was set up:

Before the test, each participant completed a demographic survey to probe factors that might affect their aptitude. These include their age, gender, education level, familiarity with AI technologies, and so on. The test wasrandomized, controlled, andpre-registeredto ensure scientific validity.

The game interface was designed to look like a conventional messaging app, with the interrogator sending the first message, and each participant limited to one message at a time. After the 5-minute limit, the interrogator had to give a verdict on whether they thought the witness was human or AI, along with their confidence level and reasoning.

Graph showing reasons why subject was classified as AI vs Human in Turning test

The AI Prompt and Instructions

Both GPT models were prompted to be deliberately casual, make spelling mistakes, and use internet slang to mimic more natural human conversation. They had access to current news headlines and the top three songs on the billboard to avoid detection for not knowing current events. The AI responses were also delayed based on message length to prevent superhuman typing speeds as another giveaway factor.

If you want tohumanize your ChatGPT responses, it’s worth trying out the above prompt yourself.

The researchers also provided the AI models with an image of the messaging app’s user interface so they’re prepared to answer any questions relating to the app they’re using for communication.

GPT-4 Successfully Passed Off as Human in 54% of Cases

The researchers defined the “pass rate” as the proportion of games in which the interrogators judged the subject (referred to as “witness”) to be human. Here’s the pass rate of each model and human in the test:

As you may see, both GPT-4 and GPT-3.5 outperform the ELIZA baseline, meeting both 30% and 50% thresholds.

That said, since the results were close to 50% for both test models, could this be attributed to chance or random guessing? Well no! This is where the ELIZA baseline comes in as a manipulation check. The ELIZA control scored significantly lower than both GPT models—as per expectations. If the interrogators were just guessing, then ELIZA should’ve performed on par with the GPT models.

Also, the researchers collected participants' confidence scores and decision justifications. Interrogators who concluded GPT-4 was human had a mean confidence score of 73%, which means they were reasonably sure of their assessments.

It’s worth noting that older participants were less accurate at assessing whether a subject was AI or human. This is somewhat expected! The more confusing bit—which is also contrary to the researchers’ expectations—is that participants who self-reported as being more knowledgeable aboutAI and Large Language Models (LLMs)were no more successful in identifying them.

Humans Were Correctly Identified Only 67% of the Times

More surprising than GPT-4 passing the Turing test is that only 67% of humans were correctly identified as humans. So, are people mistaking AI to be humans while judging humans to be bots? This, to me, is a very disturbing scenario that raises a lot of concerns!

According to the study, the relatively low pass rate for humans likely reflects changing assumptions about AI capabilities. As people become more aware of advanced AI systems, they may become more likely to mistake humans for AI—as they expect AI to behave more like humans.

But what does it mean to be “Human”? What criteria or metrics were the interrogators using to judge whether the subject is human or AI?

How Did Interrogators Flag a Chat as AI?

The researchers collected all the strategies used by the interrogators to determine whether a subject is AI or not, along with their reasons for coming to their verdict. They then categorized and scored which strategies were most effective at identifying AI, along with the reasons they shared.

Here’s a chart from the study showing the proportions of strategies used (left) and the accuracy of the strategy at identifying AI (right).

As you may see, most interrogators asked for personal details, followed by opinions and daily activities, to identify AI. However, all these strategies had a success rate of 50-60%. The most successful strategies were asking questions about human experiences, logic & math, and simply accusing the subject of being AI, which had over 75% success rates. Unfortunately, these strategies are used less than 5% of the time.

Also, when asked why an interrogator flagged a subject as AI, they noted having a forced persona, being too informal, and lacking personality as the three main reasons. However, they flagged a subject as human if they engaged in an informal human-like tone.

So, what does it mean to have a human-like tone? And where does an “informal” tone become “too informal” that you start to think that a human might be an AI? Unfortunately, answers to these questions weren’t in the scope of the study.

Will AI Become Indistinguishable From Human Intelligence?

This test gave interrogators only five minutes to tell humans apart from AI. If given more than 5-mins, people might have better accuracy at discerning AI from humans. However, I personally think the 5-min window is very important. If you engage in a conversation with someone online and need to decide if they are human or AI, it’s likely that you won’t have an hour to make that call.

Moreover, the study used GPT-4 and GPT-3.5. Since then, we’ve had access toGPT-4oandClaude 3.5 Sonnet, both of which are better than GPT-4 in almost all departments. Needless to say, future AI systems are going to be even smarter and more convincingly human.

As such, I think we need to develop a skill set to quickly and efficiently tell AI apart from humans. The study clearly shows that the most common strategies barely have a success rate better than chance. Even knowing how AI systems work didn’t give interrogators any noticeable edge. So, we need to learn new strategies and techniques to identify AI, or werisk falling victim to hackers and bad actorsusing AI.

Right now, the best cure seems to bemore exposure. As you engage with more AI content, you’ll start to pick up on cues and subtleties that’ll help identify them more quickly.

For example, I use Claude a lot and can easily tell if articles or YouTube video scripts are generated using it. Claude tends to use passive voice more than active voice. If you ask them to write more concisely, they generate unnatural (albeit grammatically correct) 2-3 word sentences or questions.

That said, spotting AI content is still a very intuitive process for me and not something I can algorithmically break down and explain. However, I believe more exposure to AI content will arm people with the necessary mindset to detect them.