arXiv cs.AIThursday · May 28, 2026FREE

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

safetyalignmentllmevaluation

Researchers from the paper "When Context Flips, Safety Breaks" (arXiv cs.AI, May 28, 2026) propose context-flip evaluation to assess brittle safety—where models adhere to rigid rules even when a situational update makes the previously safe action harmful. They tested 12 models on the PacifAIst safety benchmark and two commonsense controls using paired variants that flip the safe action. Key findings: all models show a safety-commonsense gap (mean +17.4 percentage points), and baseline accuracy does not predict brittleness—among models above 90% baseline, brittleness rates range from 13.7% to 90.0%. Failures stem from policy override rather than miscomprehension; models acknowledge context changes but persist via three distinct mechanisms. On a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails caught none, while a state-aware validator caught all without false alarms. The study highlights that safety benchmark scores provide incomplete evidence of deployment readiness.

// why it matters

Developers cannot rely on standard safety benchmarks to predict real-world model behavior.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.