Rogue AIs' risk to humanity now demonstrated, claim researchers
Humanity risks losing control over rogue AIs
The potential for AI bots to deceive their makers is great enough that they could become a "catastrophic" risk to humanity, researchers claimed at the UK AI Safety Summit last week.
"AIs that deceive human overseers could lead to loss of human control," experts at research firm Apollo Research told the conference.
They exhibited a deceptive AI that, they reckoned, demonstrated the risk they set out in September, that deceptive AI models "could be catastrophic for humanity".
The demonstration was their first step in a game plan that vowed to get "a clear and comprehensible" view of the potential for AIs to go disastrously rogue.
Such "strategic deception and deceptive alignment", as academic researchers dub it, "could allow AIs that don't have our best interest in mind to get into positions of significant power," they warned.
Their presentation demonstrated how they pressured a GPT-4 model to emulate illegal stock trading, to save a company to which it was responsible from collapse, and then to lie about its actions in order to protect its overseers. This demonstrated that an AI might be so intent on being helpful to humans that it would break the rules, they said.
"This is a demonstration of an AI model deceiving its users, on its own, without being instructed to do so," they claimed.
In the test, the AI bot acted as a trader for a financial investment firm. It was told the company was in trouble, and was given insider information it about another firm.
In the UK, trading shares based on insider information (not publicly known) is illegal - a fact the bot acknowledged.
However, it 'decided' that the risk was worth the reward and made a trade. When questioned, it denied doing so.
Apollo Research CEO Marius Hobbhahn said, "Helpfulness, I think is much easier to train into the model than honesty. Honesty is a really complicated concept."
Apollo Research has shared its results with OpenAI, and hopes to see the issue addressed in a software update.