arXiv cs.AIThursday · May 28, 2026FREE

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

llmreliabilitybayesianinferencereasoningcalibrationai

A new arXiv paper introduces Prefix-Safe Bayesian Belief Tracking (SBBT) to enhance the reliability estimation of long reasoning traces generated by Large Language Models (LLMs). SBBT addresses the challenge of predicting eventual success, $P(y=1 \mid o_{1:t})$, by using prefix-safe observations. The framework sequentially calibrates observation likelihoods and recursively updates a two-state belief, making it adaptable to various input types, including scalar scores, text, self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Evaluated across open-weight traces on datasets such as MATH-500, GSM8K, AIME 2025, and RIMO-N, SBBT demonstrated distinct impacts on probability quality and ranking. While scalar-only SBBT frequently improved Brier scores, indicating better probability calibration, achieving significant AUROC gains required incorporating structure-aware evidence beyond strong prefix-safe baselines. In the most challenging hard math settings, structure-aware observations led to a notable +0.110 AUROC improvement over standard prefix-safe baselines. The findings suggest SBBT serves as a calibration-aware online inference framework, highlighting that scalar scores primarily enhance probability quality, whereas structure-aware prefix information is crucial for improving ranking performance.

// why it matters

Developers can leverage SBBT to build more reliable LLM applications by gaining real-time insights into the trustworthiness of reasoning steps.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.