Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
A study published on arXiv investigated the robustness of Large Language Models (LLMs) in mathematical reasoning when problems are slightly varied, such as with different names or numbers. Researchers compared three distinct approaches: pure reasoning via Chain-of-Thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution with Step-by-Step Coding (SBSC). These methods were systematically tested on 1,000 paired original and modified problems sourced from the GSM-Symbolic dataset, with all evaluations conducted using Claude Haiku 4.5. The study found that CoT was the most robust method, experiencing an accuracy drop of 1.3 percentage points and breaking on 1.8% of problems under perturbation. In contrast, PAL was identified as the least robust, showing a 1.7 percentage point drop in accuracy and failing on 3.1% of modified problems. SBSC's performance fell between the other two methods. While the observed differences in robustness were not statistically significant (p = .096), the findings suggest that direct natural language reasoning may offer slightly better stability against minor problem variations in mathematical tasks compared to current code-execution methods. This contributes to understanding LLM behavior in practical applications.
Developers should consider Chain-of-Thought prompting for mathematical tasks requiring high robustness against minor input variations, even over code execution.