Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration
A paper on arXiv (2605.27752v1) investigates how measurement choices affect the comparison of token-probability scores and verbalized confidence in LLM calibration. The authors hold the verbalized-confidence elicitation fixed (single prompt template, probability scale, output format) and vary three measurement axes: which answer string receives the token-probability score, how that score is read from answer tokens, and the conditioning context. They evaluate on four QA benchmarks across three open 7-8B base/Instruct model families, with larger Qwen2.5 variants for robustness. Results show that conditioning context changes the sign or magnitude of the ECE gap across settings, token readout produces smaller but sign-moving changes, and changing the ECE estimator has little effect. Under the default generated-answer, bare-context protocol, Instruct settings show specific patterns. The study highlights that calibration comparisons are not robust to seemingly minor methodological choices.
Developers must carefully specify measurement protocols when evaluating LLM confidence calibration.