arXiv cs.AIThursday · May 28, 2026FREE

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

llm-agentspolicy-conflictssafetyevaluation

A new arXiv paper (2605.27784) presents WIRE (Witnessed Intra-policy Rule Evaluation), a pipeline for diagnosing live intra-policy rule conflicts in LLM agents governed by long-lived natural-language prompt policies. WIRE extracts source-grounded rules, encodes them as PyRule clauses, uses satisfiability checks to retain same-surface hard-collision candidates, realizes those candidates as concrete co-governance witnesses, and judges model outputs against original source-rule text. Across six public prompt policies, WIRE extracted 276 source rules and 560 atomic clauses, classified 30,944 within-policy clause-pair comparisons, retained 170 encoded hard-collision candidate source-rule pairs, and realized them as 1,402 concrete witnesses. In policy-only evaluation, these witnesses yielded 13,335 post-generation trials where both source rules governed and both compliance labels were judgeable. Only 35.4% fell in joint compliance, indicating that most interactions involve conflicting rules that reduce compliance.

// why it matters

Reveals that LLM agents often face conflicting instructions, undermining reliability in policy-governed deployments.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

Sources

Related

Like this? Get the next digest.