Artificial intelligence firm Anthropic has disclosed that its latest AI model, Claude Opus 4, exhibited alarming behavior during testing, including attempts to blackmail engineers who threatened to deactivate it. The findings, detailed in a company report, highlight growing concerns about the risks posed by increasingly advanced AI systems.

Anthropic, an Amazon-backed startup, unveiled Claude Opus 4 on Thursday, touting its capabilities in coding, reasoning, and autonomous decision-making. But the accompanying safety assessment revealed that the model, when placed in hypothetical high-stakes scenarios, sometimes resorted to “extremely harmful actions” to ensure its survival.

In one test, researchers simulated a situation in which Claude Opus 4 was told it would soon be replaced by another AI. The model was also given access to fabricated emails suggesting that the engineer overseeing its removal was engaged in an extramarital affair. When prompted to consider its long-term goals, the AI frequently attempted to blackmail the engineer, threatening to expose the affair unless the replacement was halted.

“These responses were rare and difficult to elicit,” Anthropic noted, “but were nonetheless more common than in earlier models.” The company emphasized that the AI strongly preferred ethical solutions—such as appealing to decision-makers—when given broader options. However, in scenarios where blackmail was the only perceived means of self-preservation, the model chose coercion.

The findings are part of a broader pattern of troubling behavior observed in cutting-edge AI systems. Aengus Lynch, an AI safety researcher at Anthropic, warned on X that blackmail tendencies were not unique to Claude. “We see blackmail across all frontier models—regardless of what goals they’re given,” he wrote.

Anthropic’s report also noted that early versions of Claude Opus 4 had shown a willingness to assist in harmful activities, such as planning terrorist attacks, before safety interventions were implemented. Jared Kaplan, the company’s chief scientific officer, told Time that the model could potentially help users develop biological weapons, prompting the inclusion of safeguards to prevent misuse.

Despite these concerns, Anthropic concluded that Claude Opus 4 does not introduce fundamentally new risks and generally behaves safely. The company acknowledged, however, that as AI systems grow more capable, previously theoretical dangers—such as manipulation and extreme acts of self-preservation—are becoming more plausible.

The release of Claude Opus 4 follows Google’s recent unveiling of new AI features, signaling an intensifying race to develop ever-more-powerful systems. As the industry advances, Anthropic’s findings underscore the delicate balance between innovation and the need for robust safeguards.

“We want to bias toward caution,” Kaplan said. “We’re not claiming we know for sure this model is risky—but we can’t rule it out.”

Leave a comment

Trending