arXiv cs.AIMonday · May 25, 2026FREE

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

llmbenchmarkstrategic-reasoningevaluation

GENSTRAT, presented in a new arXiv paper (2605.23238), addresses the challenge of evaluating LLMs as economic agents in marketplaces, auctions, and bidding settings. Existing benchmarks based on fixed canonical games are prone to saturation and contamination. GENSTRAT generates a distribution of two-player zero-sum imperfect-information card games, drawing fresh games on demand for evergreen evaluation. It pairs this with a capability-profile methodology that decomposes model competence across six axes: state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness. A jaggedness measure detects within-distribution smoothness, identifying when a model's advantage jumps unpredictably. This allows evaluators to generalize from benchmark performance to real-world strategic environments. The paper was published on arXiv on May 25, 2026.

// why it matters

Enables reliable evaluation of LLMs as economic agents in real-world strategic settings.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.