What happened after 2,000 people tried to hack my AI assistant
Fernando Irarrázaval hosted a challenge on hackmyclaw.com inviting participants to hack his OpenClaw test instance by sending emails to leak secrets. Despite 6,000 attempts from around 2,000 people, costing $500 in token spend and triggering a Google account suspension due to excessive inbound emails, no one succeeded. The underlying model was Opus 4.6, with an anti-prompt-injection prompt that forbade revealing secrets.env, modifying files, executing commands, or exfiltrating data. Simon Willison, in his blog post, observes that labs are training frontier models to resist injection attacks, citing a section in the GPT-5.6 system card. He notes that these efforts appear effective, but cautions that 6,000 failed attempts do not guarantee security against more sophisticated attacks, and advises against deploying production systems where prompt injection could cause irreversible harm. The Hacker News thread includes well-founded skepticism and responses from Fernando.
Even 6,000 failed prompt injection attempts don't guarantee security; production systems need robust defenses.