They can outsmart people at board games, decode the structure of proteins and have a passable conversationbut as AI systems have grown in sophistication, so has their capacity for deception, scientists warn.
The analysis, by Massachusetts Institute of Technology (MIT) researchers, identifies widespread instances of AI systems double-crossing, bluffing and pretending to be human opponents. One system even changed its behavior during mock security tests, raising the prospect of lulling auditors into a false sense of security.
“As the deceptive capabilities of AI systems become more advanced, the dangers they pose to society will become increasingly serious,” said Dr. Peter Park, an AI existential security researcher at MIT and author of the research.
Park was asked to investigate after Facebook-owned Meta developed a program called Cicero that performed in the top 10% of human players at the world conquest strategy game Diplomacy. Meta said that Cicero was trained to “largely honest and helpful” and after “never intentionally backstab” his human allies.
“It was very rosy language, which was suspicious because backstabbing is one of the most important concepts in the game,” Park said.
Sifting through publicly available data, Park and colleagues identified several instances of Cicero premeditatedly telling lies, colluding to implicate other players in plots and, on one occasion, justifying his absence after being reloaded by another player to say, “I’m on the phone with my girlfriend.” “We found that Meta’s AI has learned to be a master of deception,” Park said.
The MIT team found comparable problems with other systems, including a Texas hold ’em poker program that could bluff against professional human players and another system for economic negotiations that misrepresented its preferences to gain an upper hand.
In one study, AI organisms “played dead” in a digital simulator to fool a test designed to disable AI systems that had evolved to replicate quickly, before resuming vigorous activity once the test was completed . This highlights the technical challenge of ensuring that systems do not have unintended and unforeseen behaviour.
“It is very worrying,” says Park. “Just because an AI system is deemed safe in the test environment does not mean it is safe in the wild. It can only be to pretend that you are safe in the test.”
The review, published in the journal Patterns, calls on governments to design AI safety laws that address the potential for AI fraud. Risks from dishonest AI systems include fraud, tampering with elections and “sandbagging” where different users get different answers. Ultimately, if these systems can refine their disturbing capacity for deception, people may lose control over them, the paper suggests.
Prof Anthony Cohn, a professor of automated reasoning at the University of Leeds and the Alan Turing Institute, said the study was “timely and welcome”, adding that there was a significant challenge in how to identify desirable and undesirable behavior for Defining AI systems.
“Desirable characteristics for an AI system (the “three H’s”) are often noted as honesty, helpfulness, and harmlessness, but as already noted in the literature, these characteristics can be in opposition to each other: to be honest hurting someone’s feelings, or being helpful in responding to a question about how to build a bomb can cause harm,” he said. “Thus, deception can sometimes be a desirable feature of an AI- The authors call for more research on how to control the truth, which, although challenging, would be a step towards limiting their potentially harmful effects.
A spokesperson for Meta said: “Our Cicero work was purely a research project and the models our researchers built were trained exclusively to play the game of Diplomacy… Meta regularly shares the results of our research to validate and enable others to build responsibly from us. We have no plans to use this research or its learning in our products.”