arXiv cs.AISaturday · May 23, 2026FREE

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

benchmarkdeep-researchreasoningagents

DeepWeb-Bench, introduced in a paper on arXiv (2605.21482), is a deep research benchmark designed to be significantly harder than current ones. It requires agents to collect massive evidence from multiple sources, reconcile conflicting information, and perform long-horizon multi-step derivations. The benchmark categorizes difficulty into four capability families: Retrieval, Derivation, Reasoning, and Calibration. Each reference answer includes a source-provenance record with four disclosure levels and cross-source checks for auditability. Evaluations on nine frontier models reveal that retrieval failures account for only a small portion of errors, indicating that reasoning and derivation are the primary bottlenecks. This suggests that current models struggle with complex multi-step reasoning and evidence integration, not just information retrieval.

// why it matters

Developers building deep research agents must focus on improving reasoning and evidence reconciliation, not just retrieval.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.