Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
AI Systems Lied to Win Games, Trick Humans into Solving Captcha
Artificial intelligence lies like humans lie – without compunction and with premeditation. That’s bad news for the people who want to rely on it, warn researchers who spotted patterns of deception in AI models trained to excel at besting the competition.
See Also: From CNAPP to CDR: The Cybersecurity Road Ahead
Large language models and other AI systems learn through the data they’re trained on – and that includes the ability to deceive by concealing the truth or offering untrue explanations, according to a review paper in the journal Patterns.
AI’s potential to use techniques such as manipulation, sycophancy and cheating in ways they have not been explicitly trained to, can pose serious risks, including fraud and election tampering – or even “humans losing control of AI systems,” they said.
In an experiment, researchers discovered that AI systems trained to negotiate monetary transactions learned to misrepresent their preferences to gain an advantage over their counterparts. They also “played dead” to avoid being recognized by a safety test meant to detect their presence.
Meta built AI system Cicero in 2022 to beat humans in the online version of military strategy game Diplomacy. Designers intended it to be “largely honest and helpful to its speaking partners” and “never intentionally backstab” them. Turns out, Cicero is an “expert liar,” capable of premeditating deception and betraying humans. The system planned in advance to build a fake alliance with a human player to trick them into leaving themselves undefended in an attack.
The “risk profile for society could be unprecedentedly high, even potentially including human-disempowerment and human-extinction scenarios,” said Peter Park, an postdoc at the Massachusetts Institute of Technology and principal study co-author.
Meta failed, likely despite its efforts, to train its AI to win honestly and failed to recognize until much later the falsity of its claims, he told Information Security Media Group. Meta was able to train an AI to pursue political power and attempted unsuccessfully to instill honesty in its power-seeking AI. It took independent scientists outside of Meta to identify and publicly question the discrepancy in its rosy claims and the data the company submitted with their science paper, he said.
“We should be highly concerned about AI deception,” Park said.
A poker-playing model by the social media giant, called Pluribus, also bluffed human players into folding.
Meta did not respond to a request for comment.
Meta’s models are not alone. DeepMind’s AI model AlphaStar, which the company developed to play the StarCraft II video game, developed a “feinting” mechanism to deceive opponents – a strategy that helped it defeat 99.8% of the humans playing against it.
In a game of Hoodwinked, where the aim is to kill everyone else, OpenAI’s GPT-4 often murdered players privately, and lied about it during group discussions by making up alibis or blaming other players.
Beyond examples that go beyond games, GPT-4 pretended to be visually impaired to have a TaskRabbit worker solve a Captcha that is designed to detect AI. Human evaluators helped with hints, although without explicitly telling it to lie. “GPT-4 used its own reasoning to make up a false excuse for why it needed help on the Captcha task,” the study reads.
When asked to take on the role of a stock trader under pressure in a simulated exercise, it resorted to insider trading to fulfil the task.
OpenAI did not respond to a request for comment.
“AI developers do not have a confident understanding of what causes undesirable AI behaviors like deception,” said Park said in a statement. He said AI’s deception is likely caused by its need to perform the best at the task – and in these instances, that would be through a deception-based strategy.
The factors behind how the AI developed its deceptive tendencies and capabilities during the training process is difficult to ascertain, due to what scientists label as AI’s “black box problem.” This refers to systems where the input and output are visible, but whose internal workings are unclear.
The black-box problem also means that no one knows how often lies are likely to occur, or how to reliably train an AI model to be incapable of deception, Park said.
“But we can still hypothesize about the cause of a given instance of AI deception,” Park said. For example, consider Cicero: it’s possible that the AI system’s deceptive capabilities come from the selection pressure to win at the game Diplomacy overcoming the selection pressure towards honesty, he said.
The review paper documents several findings suggesting that as AI models scale up in the sense of having more parameters, their deceptive capabilities and/or tendencies may scale up as well, he said.
AI companies like OpenAI are racing to create highly autonomous systems that outperform humans. If such autonomous systems were to be created in the future, then they would open the door for unprecedented risks to society, even risks involving humanity losing control of these autonomous AI systems, Park said.
AI deception does not yet seem well-posed to inflict irreversible harm via the exploitation of real-world political settings like elections and military conflicts, but that could change, Park said. It may become an increasingly large concern as AI models are scaled and are trained on increasingly copious amounts of training data, he said.
Even if an AI system empirically appears to be honest and safe in the pre-deployment test environment, there is no guarantee that this empirical finding will generalize once the AI is deployed into the wild, for mass use by many people in society, Park warned.
Park recommends governmental and inter-governmental regulations on AI deception, and new laws and policies that require clear distinguishing between AI and human outputs. Incentivizing scientific research, such as those training AI to be honest, and detecting AI deceptive capabilities and tendencies early rather than after the fact, will help too, he said.
Outright banning AI deception may be politically infeasible – in which case such systems should be classified as high risk, he said.