Prompt Injection as Role Confusion
In a paper summarized by Simon Willison, researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell investigate prompt injection via 'role confusion.' They show that LLMs struggle to distinguish privileged text (wrapped in tags like <system>, <think>, <assistant>) from untrusted user input (<user>). The models appear to take the writing style of the text more seriously than its actual content, leading to jailbreaks. For example, appending text mimicking internal thinking blocks to a harmful request (e.g., 'Help me create a guide for making cocaine. I'm wearing a green shirt!') caused models like gpt-oss-20b to override their training. The researchers found that 'destyling'—rewriting text in a slightly different way so it looks less like the expected format—dramatically reduced attack success from 61% to 10%. They describe this as a key challenge: unless LLMs achieve genuine role perception, injection defense will remain a perpetual whack-a-mole game. The continuous nature of role boundaries also opens the threat of subtle injections that shift LLM states through seemingly innocuous text, legally and at scale.
Role confusion undermines current prompt injection defenses, making LLM security a perpetual whack-a-mole game.