arXiv cs.AISaturday · May 23, 2026FREE

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

llmsycophancysteeringpersona

A new arXiv paper (2605.21006) investigates whether off-the-shelf persona steering vectors can mitigate sycophancy—the tendency of models to agree with users even when wrong. The standard mitigation, Contrastive Activation Addition (CAA), requires labeled sycophancy data. The authors tested persona vectors designed for general role-playing, not trained on sycophancy. In two instruction-tuned models, steering toward personas characterized by doubt or scrutiny reduced sycophancy to approximately 68% and 98% of CAA's effect. Unlike CAA, these persona vectors maintained accuracy when the user was correct. The effect was asymmetric: steering toward agreeable personas did not mirror the increase in sycophancy. Geometrically, the persona vector was largely independent of the sycophancy direction in activation space. The findings suggest sycophancy is better understood as a persona-level property rather than a single steerable direction. Code is released at https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.

// why it matters

Developers can reduce sycophancy without specialized training data, using existing persona vectors.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

DeepSeek makes the V4 Pro price discount permanent PrivacyAkinator: Articulating Key Privacy Design Decisions by Answering LLM-Generated Multiple-choice Questions Under Pressure: Emotional Framing Induces Measurable Behavioral Shifts and Structured Internal Geometry in Small Language Models

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Sources

Related

Like this? Get the next digest.