Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs
Palette, introduced in arXiv paper 2605.24154, addresses the one-size-fits-all safety alignment problem in LLMs. Current models apply the same refusal policy across users and contexts, causing them to reject legitimate requests from authorized professionals. Palette identifies a refusal direction via multi-objective search and internalizes it through lightweight adaptation, avoiding costly realignment or inference-time steering. It supports modular composition: domain-specific safety controls are learned independently and composed via parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that Palette effectively relaxes refusal on target domains while preserving safety elsewhere. The framework offers precise control with minimal latency, making it suitable for specialized professional settings.
Enables safe, on-demand relaxation of LLM safety for authorized professionals without retraining.