Position: AI Safety Requires Effective Controllability
This position paper from arXiv cs.AI argues that AI safety must include controllability as a first-class objective, not just alignment. Controllability is defined as the ability to reliably interrupt, override, redirect, and constrain an AI system at runtime via explicit control signals, while preserving utility when such signals are absent. The authors introduce ControlBench, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms are insufficient to ensure controllability, especially under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. The paper emphasizes that aligned behavior does not guarantee a system can be stopped or overridden once deployed in open-ended environments.
Developers must prioritize controllability alongside alignment to ensure AI systems can be safely managed in real-world deployments.