arXiv cs.AIWednesday · May 27, 2026FREE

Can LLMs Introspect? A Reality Check

llmintrospectionmetacognitionarxiv

A new paper on arXiv (2605.26242) challenges recent claims that large language models can introspect their own internal states. Drawing lessons from human metacognition research, the authors argue that behavioral evidence alone is insufficient to establish genuine introspection, as models may rely on surface-level pattern matching. They re-examine two evaluation paradigms: in the first, models are tested on whether they can detect tampering with their internal states. The authors find that models cannot reliably distinguish such interventions from input manipulations, indicating that prior successes likely reflect general anomaly detection rather than specific introspection. In the second paradigm, models predict labels derived from their own hidden states; the authors find that classifiers trained on these predictions may exploit spurious correlations rather than true self-awareness. The paper concludes that strong introspective claims require more rigorous evidence beyond behavioral tests.

// why it matters

Developers relying on LLM introspection for debugging or alignment may need more robust methods.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Cross-Entropy Games and Frost Training Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

Can LLMs Introspect? A Reality Check

Sources

Related

Like this? Get the next digest.