arXiv cs.AIWednesday · May 27, 2026FREE

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

chain-of-thoughtrefusalsafetyreasoning-models

A new arXiv paper (2605.26772) investigates how chain-of-thought (CoT) reasoning affects refusal mechanisms in large reasoning models (LRMs), specifically DeepSeek-R1-Distill-LLaMA-8B. Unlike standard instruction-tuned LLMs where refusal is mediated by a single directional subspace, LRMs exhibit refusal that depends on both residual stream activations and the CoT trace. Experiments show that activation steering reverses refusal in only 39% of cases when the CoT is kept fixed. Removing the CoT entirely increases reversal to 70%, suggesting the CoT actively reinforces refusal. A two-stage intervention—regenerating the CoT under activation steering—achieves 94% reversal. Moreover, the resulting CoT alone retains 48% of the compliance effect even after steering is removed, indicating the CoT can independently carry and reconstruct the compliance signal. These findings imply that refusal in LRMs is jointly encoded, making them more robust against activation-level interventions but exposing a potential attack surface through the CoT itself.

// why it matters

Developers must consider CoT as an attack surface for jailbreaking reasoning models.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.