Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
A team from Anthropic and other institutions trained sparse autoencoders on the middle layer residual stream of Claude 3 Sonnet, a production-scale language model, to extract interpretable features. They used up to 34 million features, guided by scaling laws for hyperparameter selection. The resulting features are multilingual and multimodal, generalizing to images despite text-only training. They respond to concrete instances and abstract discussions, and can steer model behavior consistent with interpretations. Features include famous entities, locations, abstract concepts like sarcasm or code errors, and safety-relevant features such as deception, power-seeking, sycophancy, and bias. Manipulating these features causally influences model outputs. The study also analyzes feature interpretability, geometry, and computational function, though significant challenges remain. This work addresses whether dictionary learning methods scale beyond small transformers, demonstrating feasibility on a production model.
Enables interpretability and safety steering in production-scale language models.