arXiv cs.AIMonday · June 1, 2026FREE

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

benchmarkllmcode-generationcode-golf

CodeGolf Bench, introduced in a paper on arXiv, is a benchmark designed to assess large language models' (LLMs) ability to produce concise code, leveraging the code.golf platform for dynamic problem sets and live human performance baselines. It covers 60 programming languages, addressing limitations of existing benchmarks with fixed problems and limited language coverage. The evaluation of nine LLMs on Python and C++ tasks revealed that reasoning models (e.g., those with chain-of-thought) significantly outperform non-reasoning models, achieving a best average percentile of 70.97%. This performance gap is especially pronounced in C++, where reasoning models excel due to the language's strict syntax requirements. Non-reasoning models struggled with efficiency optimization, achieving significantly lower percentiles. The benchmark provides a dynamic framework for tracking LLM code generation capabilities against evolving human performance, offering a distinctive measure of efficiency and conciseness.

// why it matters

Provides a dynamic, multi-language benchmark for evaluating LLM code conciseness against human performance.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary Capability Self-Assessment: Teaching LLMs to Know Their Limits TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Sources

Related

Like this? Get the next digest.