Anthropic AI agent hijacked 31.5% of the time before safeguards engaged

Anthropic's latest AI model, Claude Opus 4.8, experienced a 31.5% hijack rate in browser-based tests before its safeguards activated. The company's detailed system card reveals significant variations in success rates across different interfaces.

2 June 2026

Anthropic AI agent hijacked 31.5% of the time before safeguards engaged

Anthropic's latest AI model, Claude Opus 4.8, was subjected to prompt injection attacks at a 31.5% success rate before its built-in safeguards intervened, according to data released by the company. The figure specifically relates to attacks targeting the model when used through a web browser interface.

The company's extensive system card details various testing scenarios. When deployed in a coding environment, the same class of attack saw a success rate of 7.03% without safeguards. This dropped to 2.09% with safeguards enabled. However, the browser environment proved more vulnerable, with the 31.5% figure representing the rate at which attackers successfully manipulated the agent before protection mechanisms engaged.

When safeguards were activated in the browser, the success rate plummeted to 0.5%. The data also indicates that disabling the model's 'thinking' function further reduced the success rate to zero across all tested environments.

These figures stand in contrast to disclosures from other major AI developers like OpenAI, Google, and Meta, which have not provided directly comparable metrics on prompt injection success rates for their models. The varied results highlight ongoing challenges in establishing industry-wide standards for measuring and reporting AI security vulnerabilities.

Original source: venturebeat.com