When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis
A new arXiv paper, "When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis," published on May 29, 2026, addresses challenges in evaluating large language models (LLMs) used by federal agencies to categorize public comments. The authors highlight that current evaluation methods, which often focus on stance accuracy against small validated sets, cannot identify instances where different LLMs generate substantially different categorizations for the same public input. This discrepancy is critical because an LLM's organization of public records can influence policymakers' perceptions and which arguments gain traction. To counter this, the paper introduces an Interpretive Audit Pipeline. This pipeline leverages multi-model disagreement as a diagnostic tool for interpretive complexity, directing human reviewers to genuinely ambiguous public input. The researchers analyzed 1,260 public comments from a federal USDA docket using four distinct LLMs. Their findings indicate that thematic divergence between models was greater than variation within a single model due to prompt changes. Furthermore, an expert rubric, while seemingly resolving disagreements, often suppressed deeper interpretive differences rather than truly resolving them. A subsequent two-stage labeling study on a 40-comment subsample, involving four LLMs and a human annotator, revealed varied revision behaviors. Notably, the human annotator frequently introduced new framings during revisions that were absent from the collective output of the LLM ensemble, underscoring the nuanced nature of human interpretation in complex public discourse.
Developers building LLM-powered analysis tools must consider multi-model disagreement as a signal for interpretive complexity, guiding more robust evaluation and human-in-the-loop design.