PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges
Researchers introduced PReMISE, a framework designed to improve the reliability and robustness of LLM judges by treating rubrics as measurement specifications. PReMISE discovers policy-level rubrics and audits existing ones across four axes, including structural adequacy and adversarial robustness. The framework found that no raw rubric source is simultaneously reliable, preference-predictive, and robust, highlighting the need for structured evaluation. PReMISE's repair operations can raise judge accuracy on paired responses from 65.0% to 68.6%.
Developers can leverage PReMISE to create more reliable and robust LLM evaluation systems, leading to better model development and deployment.


