arXiv cs.AIThursday · May 28, 2026FREE

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

ragllm-as-a-judgeevaluationmulti-hop

A paper on arXiv proposes a minimum measurement standard for LLM-as-a-judge comparisons in retrieval-augmented generation (RAG). The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. The authors stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning and materials science. The protocol changes the empirical story: a binomial test makes all four semantic-baseline comparisons look significant, whereas cluster-aware methods may not. The standard aims to address that clustered benchmarks can overstate progress.

// why it matters

Developers evaluating RAG systems must adopt cluster-aware inference to avoid overstating progress.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

Sources

Related

Like this? Get the next digest.