arXiv cs.AIMonday · May 25, 2026FREE

Agentic Proving for Program Verification

claudeagentsprogram-verificationlean4

A new arXiv paper evaluates Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. The agent generated arguably valid specifications for 98.8% of problems (81.3% accepted by CLEVER's isomorphism-based scoring), certified implementations against correct ground-truth specifications for 87.5% of problems, and achieved a 98.1% success rate on the end-to-end pipeline for entries with self-consistent premises. Claude also provided high-quality feedback on its own attempts, identifying failure causes and dataset bugs. These findings suggest a growing mismatch between benchmark difficulty and agentic prover capabilities, highlighting the need for more rigorous benchmarks.

// why it matters

Program verification benchmarks may need redesigning as agentic provers exceed current difficulty levels.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.

Agentic Proving for Program Verification — aigest.dev