CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models
CodeGolf Bench, introduced in a paper on arXiv, is a benchmark designed to assess large language models' (LLMs) ability to produce concise code, leveraging the code.golf platform for dynamic problem sets and live human performance baselines. It covers 60 programming languages, addressing limitations of existing benchmarks with fixed problems and limited language coverage. The evaluation of nine LLMs on Python and C++ tasks revealed that reasoning models (e.g., those with chain-of-thought) significantly outperform non-reasoning models, achieving a best average percentile of 70.97%. This performance gap is especially pronounced in C++, where reasoning models excel due to the language's strict syntax requirements. Non-reasoning models struggled with efficiency optimization, achieving significantly lower percentiles. The benchmark provides a dynamic framework for tracking LLM code generation capabilities against evolving human performance, offering a distinctive measure of efficiency and conciseness.
Provides a dynamic, multi-language benchmark for evaluating LLM code conciseness against human performance.