arXiv cs.AISaturday · May 23, 2026FREE

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

benchmarkai-educationevaluation

Researchers introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review each other's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%. This practice gives students direct exposure to powerful tools while requiring them to specify what a trustworthy answer would require, addressing the need for AI education that goes beyond using AI as a productivity tool.

// why it matters

Developers gain a methodology to create rigorous benchmarks that expose AI system weaknesses.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.