arXiv cs.AIWednesday · May 27, 2026FREE

Position: AI Safety Requires Effective Controllability

ai-safetycontrollabilityalignmentagentsbenchmark

This position paper from arXiv cs.AI argues that AI safety must include controllability as a first-class objective, not just alignment. Controllability is defined as the ability to reliably interrupt, override, redirect, and constrain an AI system at runtime via explicit control signals, while preserving utility when such signals are absent. The authors introduce ControlBench, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms are insufficient to ensure controllability, especially under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. The paper emphasizes that aligned behavior does not guarantee a system can be stopped or overridden once deployed in open-ended environments.

// why it matters

Developers must prioritize controllability alongside alignment to ensure AI systems can be safely managed in real-world deployments.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.