arXiv cs.AIWednesday · May 27, 2026FREE

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

llmsmathreasoningcodeexecutionrobustness

A study published on arXiv investigated the robustness of Large Language Models (LLMs) in mathematical reasoning when problems are slightly varied, such as with different names or numbers. Researchers compared three distinct approaches: pure reasoning via Chain-of-Thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution with Step-by-Step Coding (SBSC). These methods were systematically tested on 1,000 paired original and modified problems sourced from the GSM-Symbolic dataset, with all evaluations conducted using Claude Haiku 4.5. The study found that CoT was the most robust method, experiencing an accuracy drop of 1.3 percentage points and breaking on 1.8% of problems under perturbation. In contrast, PAL was identified as the least robust, showing a 1.7 percentage point drop in accuracy and failing on 3.1% of modified problems. SBSC's performance fell between the other two methods. While the observed differences in robustness were not statistically significant (p = .096), the findings suggest that direct natural language reasoning may offer slightly better stability against minor problem variations in mathematical tasks compared to current code-execution methods. This contributes to understanding LLM behavior in practical applications.

// why it matters

Developers should consider Chain-of-Thought prompting for mathematical tasks requiring high robustness against minor input variations, even over code execution.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Sources

Related

Like this? Get the next digest.