This post is also available in: Spanish
We’re entering the age of artificial intelligence. And as AI programs gets better and better at acting like humans, we will increasingly be faced with the question of whether there’s really anything that special about our own intelligence, or if we are just machines of a different kind. Could everything we know and do one day be reproduced by a complicated enough computer program installed in a complicated enough robot?
In 1950, computer pioneer and wartime codebreaker Alan Turing made one of the most influential attempts to tackle this issue. In a landmark paper, he suggested that the vagueness could be taken out of the question of human and machine intelligence with a simple test. This “Turing Test” assesses the ability of a computer to mimic a human, as judged by another human who could not see the machine but could ask it written questions.
In the last few years, several pieces of AI software have been described as having beaten the Turing Test. This has led some to argue that the test is too easy to be a useful judge of artificial intelligence. But I would argue that the Turing Test hasn’t actually been passed at all. In fact, it won’t be passed in the foreseeable future. But if one day a properly designed Turing Test is passed, it will give us cause to worry about our unique status.
The Turing Test is really a test of linguistic fluency. Properly understood, it can reveal the thing that is arguably most distinctive about humans: our different cultures. These give rise to enormous variations in belief and behaviour that aren’t seen among animals or most machines. And the fact we can program this kind of variation into computers is what gives them the potential to mimic human abilities. In judging fluent mimicry, the Turing Test lets us look for the ability of computers to share in human culture by demonstrating their grasp of language in a social context.
Turing based his test on the “imitation game”, a party game in which a man pretended to be a woman and a judge tried to guess who was who by asking the concealed players questions. In the Turing Test, the judge would try to guess who was a computer and who was a real human.
Unsurprisingly, in 1950, Turing didn’t work out the necessary detailed protocol for us to judge today’s AI software. For one thing, he suggested the test could be done in just five minutes. But he also didn’t work out that the judge and the human player had to share a culture and that the computer would have to try to emulate it. That’s led to lots of people claiming that the test has been passed and others claiming that the test is too easy or should include emulation of physical abilities.
First claimed pass
Some of this was made obvious nearly 50 years ago with the construction of the program known as ELIZA by computer scientist Joseph Weizenbaum. ELIZA was used to simulate a type of psychotherapist known as a Rogerian, or person-centred, therapist. Several patients who interacted with it thought it was real, leading to the earliest claim that the Turing Test had been passed.
But Weizenbaum was clear that ELIZA was, in effect, a joke. The setup didn’t even follow what little protocol Turing did provide because patients didn’t know they were looking out for fraud and there were no simultaneous responses from a real psychotherapist. Also, culture wasn’t part of the test because Rogerian therapists say as little as possible. Any worthwhile Turing Test has to have the judge and the human player acting in as human-like a way as possible.
Given that this is a test of understanding text, computers need to be judged against the abilities of the top few percent of copy-editors. If the questions are right, they can indicate whether the computer has understood the material culture of the other participants.
The right kind of question could be based on the 1975 idea of “Winograd schemas”, pairs of sentences that differ by just one or two words that require a knowledge of the world to understand. A test for AI based on these is known as a Winograd Schema Challenge and was first proposed in 2012 as an improvement on the Turing Test.
Consider the following sentence with two possible endings: “The trophy would not fit in the suitcase because it was too small/large.” If the final word is “small”, then “it” refers to the suitcase. If the final word is “large”, then “it” refers to the trophy.
To understand this, you have to understand the cultural and practical world of trophies and suitcases. In English-speaking society, we use language in such a way that even though a small trophy doesn’t exactly “fit” a large suitcase that’s not what a normal English speaker would mean by “fit” in this context. That’s why in normal English, if the final word is “small”, “it” has to refer to the suitcase.
You also have to understand the physical world of trophies and suitcases as well as if you had actually handled them. So a Turing Test that took this kind of approach would make a test that included an assessment of an AI’s ability to emulate a human’s physical abilities redundant.
A higher bar
This means a Turing Test based on Winograd schemas is a much better way to assess a computer’s linguistic and cultural fluency than a simple five-minute conversation. It also sets a much higher bar. All the computers in one such competition in 2016 failed miserably, and no competitors were entered from the large AI-based firms because they knew they would fail.
None of the claims that the Turing Test has already been passed mean anything if it is set up as a serious test of humanity’s distinctive abilities to create and understand culture. With a proper protocol, the test is as demanding as it needs to be. Once more, Alan Turing got it right. And, as we stand, there is no obvious route to creating machines that can participate in human culture sufficiently deeply to pass the right kind of linguistic test.