Anthropic’s AI resorts to blackmail in simulations

May 23, 2025

Anthropic said its latest artificial intelligence model resorted to blackmail when told it would be taken offline.

In a safety test, the AI company asked Claude Opus 4 to act as an assistant to a fictional company, but then gave it access to (also fictional) emails saying that it would be replaced, and also that the engineer behind the decision was cheating on his wife. Anthropic said the model “[threatened] to reveal the affair” if the replacement went ahead.

AI thinkers such as Geoff Hinton have long worried that advanced AI would manipulate humans in order to achieve its goals. Anthropic said it was increasing safeguards to levels reserved for “AI systems that substantially increase the risk of catastrophic misuse.”

Search

RECENT PRESS RELEASES