A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test
A paper on arXiv proposes a minimum measurement standard for LLM-as-a-judge comparisons in retrieval-augmented generation (RAG). The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. The authors stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning and materials science. The protocol changes the empirical story: a binomial test makes all four semantic-baseline comparisons look significant, whereas cluster-aware methods may not. The standard aims to address that clustered benchmarks can overstate progress.
Developers evaluating RAG systems must adopt cluster-aware inference to avoid overstating progress.