Prompt Injection as Role Confusion
Researchers Charles Ye, Jasmine Cui, and Dylan Hadfield-Menell found that LLMs can be confused by text styled like internal role tags (e.g., <system>, <think>), overriding training. 'Destyling' text to look less like role formats reduced attack success from 61% to 10%. They call this 'role confusion' and warn that injection defense may remain a whack-a-mole game.
Role confusion undermines current prompt injection defenses, making LLM security a perpetual whack-a-mole game.


