arXiv cs.AISaturday · May 23, 2026FREE

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

agentsbenchmarkinguser-simulationllm

RealUserSim, introduced in a paper on arXiv (2605.20204), addresses the gap between simulated and real users in agent benchmarking. The framework extracts 7,275 executable behavioral profiles from 14,000+ authentic human-LLM conversations in WildChat, grounding LLM simulators in real data. A fidelity benchmark (PT3) on 600 conversations across 71+ domains with anti-leakage controls shows grounded simulation raises match rate from 24.2% to 45.3% across five behavioral dimensions. Agent evaluation on TauBench with 6 simulator models reveals that grounded simulation acts as a realistic stress test, surfacing three failure mechanisms invisible to cooperative simulators, with a mean task success degradation of -3.2% to -3.5%. The paper highlights that unconstrained LLM defaults produce a Formalism Ceiling (6-8% style match rates), while hand-crafted directives cause Directive Amplification, leading to unnatural behavioral extremes.

// why it matters

Enables more realistic agent evaluation, uncovering hidden failures that cooperative simulators miss.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

What Anthropic and OpenAI launched in 72 hours has Wall Street paying attention Open source Kanban desktop app that runs parallel agents on every card DeepSeek makes the V4 Pro price discount permanent

RealUserSim: Bridging the Reality Gap in Agent Benchmarking via Grounded User Simulation

Sources

Related

Like this? Get the next digest.