SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
SkillEvolBench, introduced in a new arXiv paper, is a diagnostic benchmark designed to assess whether LLM agents can evolve from reusing episodic experiences to forming reusable procedural skills. The benchmark comprises 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents first learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks that test context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, the authors find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, but individual models still struggle with generalization. The benchmark provides a standardized evaluation for future research on skill formation in LLM agents.
Developers building LLM agents need to know that current models fail to form reusable skills from experience.