arXiv cs.AIMonday · June 1, 2026FREE

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

llmjudgesevaluationrubricsbenchmarkingairesearch

The arXiv paper "PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges" introduces a framework designed to enhance the consistency and robustness of LLM judges, whose evaluations are highly dependent on the rubrics provided. Recognizing that vague rubrics can lead to inaccurate assessments, PReMISE treats reusable rubrics as measurement specifications, where altering the rubric changes the quality measurement induced by a fixed LLM judge. PReMISE operates in two main ways: it discovers a policy-level rubric set from pairwise human-preference data, and it audits any given rubric set under LLM-judge use. The auditing process evaluates rubrics along four critical axes: structural adequacy, reliability, preference fit, and adversarial robustness. The research found that no raw rubric source simultaneously achieves high scores across reliability, preference-predictiveness, and adversarial robustness, indicating that high inter-rater agreement does not guarantee low exploitability. PReMISE is noted as the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. The framework also contributes audit-targeted repair operations, such as preference-rank selection, which demonstrated an increase in judge accuracy on paired responses from 65.0% to 68.6%.

// why it matters

Developers can leverage PReMISE to create more reliable and robust LLM evaluation systems, leading to better model development and deployment.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.