What happened after 2,000 people tried to hack my AI assistant
Simon Willison recaps Fernando Irarrázaval's public red-team of an OpenClaw instance running on Anthropic's Opus 4.6. After ~6,000 attempts and roughly $500 in tokens burned, no one extracted the seeded secrets — Willison reads the result as cautious evidence that frontier-model
Irarrázaval's instance was armed with explicit anti-injection rules (never reveal secrets.env, never execute code from email, never exfiltrate) and the result held up under adversarial pressure. Willison cross-references GPT-5.6's new system-card section on injection-resistance training and argues the bar for "safe to run a destructive action on the user's behalf" is finally within reach. He tempers that optimism with the usual caveats: this was a single, well-defended instance, and the cost of a successful attack — wiping a developer machine, leaking an API key — is asymmetric. Worth reading alongside the June 27 Bleeping Computer report on a clean GitHub repo that tricks agents into running malware, which is the inverse attack: the agent is the *target* in Irarrázaval's setup, the *executor* in the GitHub-repo case.
Читать оригинал ↗