Confidence Calibration in Large Language Models
A new study on arXiv (2605.23909) examines confidence calibration in large language models. The preregistered research shows that LLMs, like humans, exhibit overconfidence on average, with accuracy lagging behind confidence. However, this effect is moderated by task difficulty: overconfidence is most pronounced on hard tests, while easy tests show substantial underconfidence. The authors introduce LifeEval, a benchmark designed to evaluate model calibration across varying difficulty levels. The findings underscore that current LLMs are not well-calibrated, which has implications for trust and reliability in deployment.
// why it matters
Developers cannot trust LLM confidence scores, especially on hard tasks.