Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles
A new arXiv paper (2605.27784) presents WIRE (Witnessed Intra-policy Rule Evaluation), a pipeline for diagnosing live intra-policy rule conflicts in LLM agents governed by long-lived natural-language prompt policies. WIRE extracts source-grounded rules, encodes them as PyRule clauses, uses satisfiability checks to retain same-surface hard-collision candidates, realizes those candidates as concrete co-governance witnesses, and judges model outputs against original source-rule text. Across six public prompt policies, WIRE extracted 276 source rules and 560 atomic clauses, classified 30,944 within-policy clause-pair comparisons, retained 170 encoded hard-collision candidate source-rule pairs, and realized them as 1,402 concrete witnesses. In policy-only evaluation, these witnesses yielded 13,335 post-generation trials where both source rules governed and both compliance labels were judgeable. Only 35.4% fell in joint compliance, indicating that most interactions involve conflicting rules that reduce compliance.
Reveals that LLM agents often face conflicting instructions, undermining reliability in policy-governed deployments.