arXiv cs.AISaturday · May 23, 2026FREE

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

llmplanningbenchmarkdata-generation

PlanningBench, introduced in a new arXiv paper (2605.20873v1), addresses the limitation of existing planning benchmarks that treat data as fixed instances. The framework starts from real planning scenarios and abstracts them into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. A constraint-driven synthesis pipeline then instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This enables scalable generation, automatic verification, and planning-oriented training, moving beyond surface-level difficulty proxies to structural sources. The approach supports both evaluation and training of LLMs, offering controllable scenario coverage and verifiable solutions.

// why it matters

Enables scalable, verifiable planning data generation for training and evaluating LLMs.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

Sources

Like this? Get the next digest.