The Low-Down: If AI Made the Turing Test Obsolete, Can Ability To Think Be Measured?

There may not be an answer in the short term as to which test is most effective to measure how intelligent machines may be. But how best to analyze machine behavior may provide a useful interim set of evaluative paths to follow. JL

Rupendra Brabhambhatt reports in ars technica:

Standard methods of evaluating a machine’s intelligence, such as the Turing Test, can only tell if the machine is good at processing information and mimicking human responses. The current generations of AI programs, such as Google’s LaMDA and OpenAI’s ChatGPT, have come close to passing the Turing Test, yet the results don’t imply these programs can think and reason like humans. There is a need for new methods that assess the intelligence of machines. An alternative to the Turing Test should answer the question, "do programs reason in the way humans reason?”(But) rather than providing a blueprint for a test, this is likely to encourage discussions of how best to analyze machine behavior.
If a machine or an AI program matches or surpasses human intelligence, does that mean it can simulate humans perfectly? If yes, then what about reasoning—our ability to apply logic and think rationally before making decisions? How could we even identify whether an AI program can reason? To try to answer this question, a team of researchers has proposed a novel framework that works like a psychological study for software.
"This test treats an 'intelligent' program as though it were a participant in a psychological study and has three steps: (a) test the program in a set of experiments examining its inferences, (b) test its understanding of its own way of reasoning, and (c) examine, if possible, the cognitive adequacy of the source code for the program," the researchers note.
They suggest the standard methods of evaluating a machine’s intelligence, such as the Turing Test, can only tell you if the machine is good at processing information and mimicking human responses. The current generations of AI programs, such as Google’s LaMDA and OpenAI’s ChatGPT, for example, have come close to passing the Turing Test, yet the test results don’t imply these programs can think and reason like humans.
This is why the Turing Test may no longer be relevant, and there is a need for new evaluation methods that could effectively assess the intelligence of machines, according to the researchers. They claim that their framework could be an alternative to the Turing Test. “We propose to replace the Turing test with a more focused and fundamental one to answer the question: do programs reason in the way that humans reason?” the study authors argue.
What’s wrong with the Turing Test?
During the Turing Test, evaluators play different games involving text-based communications with real humans and AI programs (machines or chatbots). It is a blind test, so evaluators don’t know whether they are texting with a human or a chatbot. If the AI programs are successful in generating human-like responses—to the extent that evaluators struggle to distinguish between the human and the AI program—the AI is considered to have passed. However, since the Turing Test is based on subjective interpretation, these results are also subjective.
The researchers suggest that there are several limitations associated with the Turing Test. For instance, any of the games played during the test are imitation games designed to test whether or not a machine can imitate a human. The evaluators make decisions solely based on the language or tone of messages they receive. ChatGPT is great at mimicking human language, even in responses where it gives out incorrect information. So, the test clearly doesn’t evaluate a machine's reasoning and logical ability.
The results of the Turing Test also can’t tell you if a machine can introspect. We often think about our past actions and reflect on our lives and decisions, a critical ability that prevents us from repeating the same mistakes. The same applies to AI as well, according to a study from Stanford University which suggests that machines that could self-reflect are more practical for human use.
“AI agents that can leverage prior experience and adapt well by efficiently exploring new or changing environments will lead to much more adaptive, flexible technologies, from household robotics to personalized learning tools,” Nick Haber, an assistant professor from Stanford University who was not involved in the current study, said.
In addition to this, the Turing Test fails to analyze an AI program’s ability to think. In a recent Turing Test experiment, GPT-4 was able to convince evaluators that they were texting with humans over 40 percent of the time. However, this score fails to answer the basic question: Can the AI program think?
Alan Turing, the famous British scientist who created the Turing Test, once said, “A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.” His test only covers one aspect of human intelligence, though: imitation. Although it is possible to deceive someone using this one aspect, many experts believe that a machine can never achieve true human intelligence without including those other aspects.
“It’s unclear whether passing the Turing Test is a meaningful milestone or not. It doesn’t tell us anything about what a system can do or understand, anything about whether it has established complex inner monologues or can engage in planning over abstract time horizons, which is key to human intelligence,” Mustafa Suleyman, an AI expert and founder of DeepAI, told Bloomberg.
An alternative to the Turing Test
Study authors Philip Johnson-Laird, a retired psychology professor from Princeton University, and Marco Ragni, a researcher at the Germany-based Chemnitz University of Technology, recognized these limitations and designed a three-step framework that has the potential to replace the Turing Test. They propose that an AI program should be considered the equivalent of a human in intelligence only if it can pass the following three challenges:
Step 1: A series of psychological experiments.
The researchers suggest exposing an AI program to numerous psychological tests designed to test human reasoning and logical thinking and putting them in situations where a subject is required to explore and understand nuances. The AI models should be able to derive different outcomes resulting from different possibilities, which is what an evaluator should first do to measure their level of intelligence. The significance of such tests can be understood from the following example:
Imagine an AI program is assigned to prepare a detailed weather forecast. The program understands the basic meaning of cloudiness and humidity because of the data it was trained on. However, if the AI model can also understand correlations between humidity levels, cloudiness, and temperature, it is likely to produce a better report than an AI that lacks the ability to connect these factors.
Step 2: Testing AI’s ability to introspect
The study authors recommend using special “programs” (meaning a series of linked questions, in this case) to see whether an AI can explain the reasoning or logic it applied to solve a problem. They strongly believe that an intelligent AI should be able to self-reflect on its actions and performance—without this ability, they can’t be considered as intelligent as humans.

The researchers describe an example of this: "If Ann is intelligent, does it follow that Ann is intelligent or she is rich, or both? If the program rejects this inference as humans do even though it is logically valid, then the next question is: Why do you think that the inference does not follow? A sign of human-like reasoning is this sort of answer: Nothing in the premise supports the possibility that Ann is rich."
Step 3: Going deep into the source
The final step is to carefully examine the code of the AI program to detect elements that have the potential to promote human-like reasoning, thinking, and inferring. “If it contains the same major components of the programs that are known to simulate human performance, that evidence is decisive. If it relies instead on some sort of deep learning, then the answer is equivocal—at least until another algorithm is able to explain how the program reasons. If its principles are quite different from human ones, it has failed the test,” the researchers added.
However, the study doesn’t provide a clear picture of how the source code screening would work, which is a big limitation of this framework.
An important thing to notice about this entire process is that it evaluates an AI program not as a machine or a chatbot but as some real subject enrolling for an in-depth psychological analysis. This humane method might overcome some of the limitations of the Turing Test. But, similar to the Turing Test, it is a subjective approach—it requires people to make judgments about the behavior of algorithms. So, different evaluators might see things differently while deciding how smart a machine is.
So, rather than providing a blueprint for an objected test, this paper is likely meant to encourage discussions of how best to analyze machine behavior.