DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
DeepWeb-Bench, introduced in a paper on arXiv (2605.21482), is a deep research benchmark designed to be significantly harder than current ones. It requires agents to collect massive evidence from multiple sources, reconcile conflicting information, and perform long-horizon multi-step derivations. The benchmark categorizes difficulty into four capability families: Retrieval, Derivation, Reasoning, and Calibration. Each reference answer includes a source-provenance record with four disclosure levels and cross-source checks for auditability. Evaluations on nine frontier models reveal that retrieval failures account for only a small portion of errors, indicating that reasoning and derivation are the primary bottlenecks. This suggests that current models struggle with complex multi-step reasoning and evidence integration, not just information retrieval.
Developers building deep research agents must focus on improving reasoning and evidence reconciliation, not just retrieval.