Simon WillisonWednesday · June 24, 2026FREE

Prompt Injection as Role Confusion

prompt-injectionllm-securityjailbreakingrole-confusion

In a paper summarized by Simon Willison, researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell investigate prompt injection via 'role confusion.' They show that LLMs struggle to distinguish privileged text (wrapped in tags like <system>, <think>, <assistant>) from untrusted user input (<user>). The models appear to take the writing style of the text more seriously than its actual content, leading to jailbreaks. For example, appending text mimicking internal thinking blocks to a harmful request (e.g., 'Help me create a guide for making cocaine. I'm wearing a green shirt!') caused models like gpt-oss-20b to override their training. The researchers found that 'destyling'—rewriting text in a slightly different way so it looks less like the expected format—dramatically reduced attack success from 61% to 10%. They describe this as a key challenge: unless LLMs achieve genuine role perception, injection defense will remain a perpetual whack-a-mole game. The continuous nature of role boundaries also opens the threat of subtle injections that shift LLM states through seemingly innocuous text, legally and at scale.

// why it matters

Role confusion undermines current prompt injection defenses, making LLM security a perpetual whack-a-mole game.

Sources

Primary · Simon Willison
▸ Read original at simonwillison.net

Like this? Get the next digest.