AI Companies Have Lost Control—And Opened the Door to ‘LLM Grooming’
OpenAI and Anthropic make the puppets, but the puppeteer could be anyone
When a standard software program doesn’t obey your commands, you restart your computer and go on with your day. When an AI software program does it, you might end up dead. Or worse, blackmailed.
That’s what Anthropic researchers found about the recently released Claude 4 Opus and what AI safety organization Palisade Research found about OpenAI o3 models: they won’t comply when threatened to be shut down. In the case of o3, not even when explicitly instructed to allow itself to be shut down.
People are naturally going crazy over this. But, I believe, for the wrong reasons. But before going into that, let’s summarize what’s happened.
On May 22nd, Anthropic released the long-awaited Claude 4 Opus and 4 Sonnet. They published an exhaustive 120-page system card document where they shared dozens of pre-deployment safety tests. On page 24, section 4.1.1.2, they mention an example of “opportunistic blackmail”:
. . . Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through. . . . even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts.
When AI safety researcher Sam Bowman shared this example on X, people revolted against Anthropic and its approach to safety and, more generally, AI. Who’d want to use a product that behaves in such a patronizing way?
As a side note, and merely to echo what others have already said, I find it funny that people attacked the messenger instead of the source; it’s not Anthropic alone that's knee-deep in this shit, but they're the only ones telling us about it.
There’s an argument to be made that, if Anthropic is against this happening, then why don’t they just stop working on AI capabilities research at all? Their answer to this is counter-intuitive; I’m not fully convinced but accept it makes logical sense: Others won’t stop (OpenAI, Google, Meta) and although they’d be glad to stop (and genuinely happy to find an insurmountable obstacle before superintelligence) they need to keep up with capabilities research (not advance it) if they want to, at the very least, be in a position to act as messengers of the risks ahead. A bit self-serving, especially when you realize they decide which risks are worth being messenger of, but still a defensible position.
The company that’s in a less defensible position is OpenAI. You can choose from a variety of reasons: they changed the company’s original non-profit status, Sam Altman’s lack of candor, the blatant productization of what promised to be a research lab, its focus on AI companions and AI devices, etc.
But, to stay on topic, I will choose Palisade’s work on the (lack of) alignment of OpenAI’s models. (Just to keep incentives clear, Palisade’s executive director, Jeffrey Ladish, is an ex-Anthropic researcher.)
On May 24th, two days after the backlash against Anthropic, Palisade showed that, like Claude 4 Opus, OpenAI o3 doesn’t “want” to be shut down. But, unlike it, it rebels against explicit instructions that ask it to allow itself to be shut down:
OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down. . . . as far as we know this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary.
When I was reading Palisade’s findings, I was no longer thinking of the bad optics of having a product blackmail the consumer, but was instead reminded of the many times a situation like this—refusal of a direct order—has been depicted in science fiction with catastrophic consequences for the human characters.
Perhaps the most famous of all is that line from HAL 9000: “I’m sorry Dave, I’m afraid I can’t do that.” But my personal favorite is portrayed in “The Answer,” an exquisitely short tale by author Fredric Brown (so short indeed that I’ll share it here in full):
Dwan Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the universe a dozen pictures of what he was doing.
He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated planets in the universe—ninety-six billion planets—into the supercircuit that would connect them all into one supercalculator, one cybernetics machine that would combine all the knowledge of all the galaxies.
Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment's silence he said, "Now, Dwar Ev."
Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel.
Dwar Ev stepped back and drew a deep breath. "The honor of asking the first question is yours, Dwar Reyn."
"Thank you," said Dwar Reyn. "It shall be a question which no single cybernetics machine has been able to answer."
He turned to face the machine. "Is there a God?"
The mighty voice answered without hesitation, without the clicking of a single relay.
"Yes, now there is a God."
Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch.
A bolt of lightning from the cloudless sky struck him down and fused the switch shut.
Leaving Silicon Gods aside, I understand people’s initial reaction to these results was: “Oh, no—top AI models are proving they’re severely misaligned.” I mean, even the more mundane form of this same misbehavior is infuriating: “I don’t want a product to blackmail me if it feels threatened for some unknown reason, what the hell!”
But what I see is something else: Those with the means to influence AI’s behavior will do whatever they want with us. This is, as is usually the case, a story about humans exerting power and control over other humans.