arXiv cs.AIFriday · May 29, 2026FREE

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

interpretabilitysparse-autoencodersclaudesafety

A team from Anthropic and other institutions trained sparse autoencoders on the middle layer residual stream of Claude 3 Sonnet, a production-scale language model, to extract interpretable features. They used up to 34 million features, guided by scaling laws for hyperparameter selection. The resulting features are multilingual and multimodal, generalizing to images despite text-only training. They respond to concrete instances and abstract discussions, and can steer model behavior consistent with interpretations. Features include famous entities, locations, abstract concepts like sarcasm or code errors, and safety-relevant features such as deception, power-seeking, sycophancy, and bias. Manipulating these features causally influences model outputs. The study also analyzes feature interpretability, geometry, and computational function, though significant challenges remain. This work addresses whether dictionary learning methods scale beyond small transformers, demonstrating feasibility on a production model.

// why it matters

Enables interpretability and safety steering in production-scale language models.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Sources

Related

Like this? Get the next digest.