PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges
The arXiv paper "PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges" introduces a framework designed to enhance the consistency and robustness of LLM judges, whose evaluations are highly dependent on the rubrics provided. Recognizing that vague rubrics can lead to inaccurate assessments, PReMISE treats reusable rubrics as measurement specifications, where altering the rubric changes the quality measurement induced by a fixed LLM judge. PReMISE operates in two main ways: it discovers a policy-level rubric set from pairwise human-preference data, and it audits any given rubric set under LLM-judge use. The auditing process evaluates rubrics along four critical axes: structural adequacy, reliability, preference fit, and adversarial robustness. The research found that no raw rubric source simultaneously achieves high scores across reliability, preference-predictiveness, and adversarial robustness, indicating that high inter-rater agreement does not guarantee low exploitability. PReMISE is noted as the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. The framework also contributes audit-targeted repair operations, such as preference-rank selection, which demonstrated an increase in judge accuracy on paired responses from 65.0% to 68.6%.
Developers can leverage PReMISE to create more reliable and robust LLM evaluation systems, leading to better model development and deployment.