PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
PlanningBench, introduced in a new arXiv paper (2605.20873v1), addresses the limitation of existing planning benchmarks that treat data as fixed instances. The framework starts from real planning scenarios and abstracts them into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. A constraint-driven synthesis pipeline then instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This enables scalable generation, automatic verification, and planning-oriented training, moving beyond surface-level difficulty proxies to structural sources. The approach supports both evaluation and training of LLMs, offering controllable scenario coverage and verifiable solutions.
Enables scalable, verifiable planning data generation for training and evaluating LLMs.