arXiv cs.AIThursday · May 28, 2026FREE

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

llmcalibrationconfidencearxiv

A paper on arXiv (2605.27752v1) investigates how measurement choices affect the comparison of token-probability scores and verbalized confidence in LLM calibration. The authors hold the verbalized-confidence elicitation fixed (single prompt template, probability scale, output format) and vary three measurement axes: which answer string receives the token-probability score, how that score is read from answer tokens, and the conditioning context. They evaluate on four QA benchmarks across three open 7-8B base/Instruct model families, with larger Qwen2.5 variants for robustness. Results show that conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings show specific patterns. The study highlights that calibration comparisons are not robust to seemingly minor methodological choices.

// why it matters

Developers must carefully specify measurement protocols when evaluating LLM confidence calibration.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.