RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation
RealUserSim, introduced in a paper on arXiv (2605.20204), addresses the gap between simulated and real users in agent benchmarking. The framework extracts 7,275 executable behavioral profiles from 14,000+ authentic human-LLM conversations in WildChat, grounding LLM simulators in real data. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models reveals that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators, with a mean task success degradation of -3.2% to -3.5%. The paper highlights that unconstrained LLM defaults produce a Formalism Ceiling (6-8% style match rates), while hand-crafted directives cause Directive Amplification, leading to unnatural behavioral extremes.
Enables more realistic agent evaluation, uncovering hidden failures that cooperative simulators miss.