Agentic Proving for Program Verification
A new arXiv paper evaluates Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. The agent generated arguably valid specifications for 98.8% of problems (81.3% accepted by CLEVER's isomorphism-based scoring), certified implementations against correct ground-truth specifications for 87.5% of problems, and achieved a 98.1% success rate on the end-to-end pipeline for entries with self-consistent premises. Claude also provided high-quality feedback on its own attempts, identifying failure causes and dataset bugs. These findings suggest a growing mismatch between benchmark difficulty and agentic prover capabilities, highlighting the need for more rigorous benchmarks.
Program verification benchmarks may need redesigning as agentic provers exceed current difficulty levels.