arXiv cs.AIMonday · June 1, 2026FREE

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

benchmarkagentsdata-analysislong-horizon

Researchers introduced LongDS-Bench, a benchmark for evaluating long-horizon, multi-turn data analysis agents. The benchmark comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains: Geoscience, Business, Education, and others. Tasks are designed around state-evolution patterns such as counterfactual perturbation, rollback, and multi-state composition, with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, the best model reached only 48.45% average accuracy. Performance dropped nearly 47 points from early to late turns, and long-horizon errors accounted for 52%–69% of failures. Further analysis showed that additional agent steps do not necessarily improve performance, suggesting the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget.

// why it matters

Developers building data analysis agents must prioritize state management over interaction steps.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.