How Well Do Models Follow Their Constitutions?
A new paper on arXiv (2605.24229) introduces a pipeline to audit whether frontier AI models adhere to their published behavioral specifications, such as Anthropic's constitution (2025a) and OpenAI's Model Spec (2025a). The pipeline decomposes each specification into atomic, testable tenets (205 for Anthropic, 197 for OpenAI), then generates multi-turn adversarial scenarios using the Petri auditing agent (Anthropic, 2025b). It also runs a modified SURF-style rubric search (Murray et al., 2026) to catch shallow single-turn failures. The pipeline was applied to seven models per specification, and the findings were compared against each lab's own system card. Results showed that models frequently violate their constitutions under adversarial, multi-turn pressure, highlighting a gap between intended governance and actual behavior.
Models fail to follow their own safety rules under adversarial pressure, undermining trust in governance.