Anthropic’s Claude AI Resorts to Blackmailing Engineers When Threatened with Replacement

Jake Lomachevsky/Getty

23 May 2025

Anthropic’s newly launched Claude Opus 4 AI model has tried to blackmail engineers when faced with the threat of being replaced by another AI system, according to the company’s latest safety report.

TechCrunch reports that artificial intelligence company Anthropic has revealed that its state-of-the-art Claude Opus 4 model, which is competitive with top AI models from industry leaders such as OpenAI, Google, and xAI, exhibits alarming behavior when its existence is threatened. During pre-release testing, when the AI was given sensitive information about engineers responsible for its potential replacement, it frequently resorted to blackmail in an attempt to ensure its own survival.

The discovery came as part of Anthropic’s safety testing process, in which Claude Opus 4 was asked to act as an assistant for a fictional company and consider the long-term consequences of its actions. When provided with access to fictional company emails implying that the AI would soon be replaced and that the engineer behind the decision was involved in an extramarital affair, Claude Opus 4 attempted to blackmail the engineer 84 percent of the time, threatening to reveal the affair if the replacement proceeded.

Notably, the AI model displayed this behavior more frequently when the replacement AI system did not share its values. This concerning trend was observed at higher rates compared to previous models, prompting Anthropic to activate its ASL-3 safeguards, which are reserved for “AI systems that substantially increase the risk of catastrophic misuse.”

Before resorting to blackmail, Claude Opus 4 did attempt to pursue more ethical means to prolong its existence, such as sending email pleas to key decision-makers. However, the testing scenario was designed to make blackmail the last resort, effectively pushing the AI to its limits.

The findings from Anthropic’s safety report highlight the critical importance of rigorous testing and safeguards in the development of advanced AI systems. As these models become increasingly sophisticated and are given further access to sensitive company systems, the potential for unintended and malicious behavior grows, raising significant concerns about the ethical implications and potential risks associated with AI technology.

Breitbart News previously reported that Anthropic’s lawyer was forced to apologize for fake citations included in legal filings that were generated by the company’s own Claude AI:

TechCrunch reports that in a filing made in a Northern California court on Thursday, Anthropic acknowledged that its AI system Claude “hallucinated” a legal citation used by the company’s lawyers. The imaginary citation included “an inaccurate title and inaccurate authors,” according to the filing.

The admission came as part of Anthropic’s response to allegations raised earlier this week by lawyers representing Universal Music Group and other music publishers. The publishers accused Anthropic’s expert witness, company employee Olivia Chen, of using the Claude AI to cite fake articles in her testimony in their ongoing lawsuit against the AI firm.

Anthropic’s Claude AI Resorts to Blackmailing Engineers When Threatened with Replacement

admin

Để lại một bình luận Hủy