Can LLMs Introspect? A Reality Check
A new paper on arXiv (2605.26242) challenges recent claims that large language models can introspect their own internal states. Drawing lessons from human metacognition research, the authors argue that behavioral evidence alone is insufficient to establish genuine introspection, as models may rely on surface-level pattern matching. They re-examine two evaluation paradigms: in the first, models are tested on whether they can detect tampering with their internal states. The authors find that models cannot reliably distinguish such interventions from input manipulations, indicating that prior successes likely reflect general anomaly detection rather than specific introspection. In the second paradigm, models predict labels derived from their own hidden states; the authors find that classifiers trained on these predictions may exploit spurious correlations rather than true self-awareness. The paper concludes that strong introspective claims require more rigorous evidence beyond behavioral tests.
Developers relying on LLM introspection for debugging or alignment may need more robust methods.