More eval traces will not stabilize your kappa. Stratify the ones you have
A developer found that their LLM-as-judge agreement (Cohen's kappa against human labels) swung between 0.41 and 0.63 week to week with no rubric change. Initially suspecting sample size, they increased from 50 weekly traces to 200, but variance barely moved. Stratifying the original 50 traces by score class and known failure dimensions reduced the swing more than quadrupling the sample did. The judge scored production traces against a 5-point rubric; each week a calibration set was hand-labeled and kappa computed. Random sampling pulled mostly from the majority class (clean passes, easy 5s), while kappa is driven by agreement on rare, ambiguous classes (2s and 3s). Thus, 200 random traces added mostly more easy passes—more data but almost no new signal where it counts. The key insight: composition is the lever, not volume.
Stratifying evaluation traces by class improves judge stability without increasing sample size.