What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
Researchers at arXiv (2605.26795) investigated why chain-of-thought (CoT) prompting improves language model accuracy, focusing on probe-time effects rather than generation. They found that even a globally word-shuffled rationale substantially outperforms the no-rationale baseline, indicating a strong lexical activation effect. The additional gain from structured text arises less from sentence-level logical ordering and more from short-range token adjacency: preserving contiguous windows of just 2-3 tokens recovers most of the remaining gain toward full CoT performance. Experiments ruled out copying of explicit answer declarations or answer values, as well as full grammatical realization, as primary drivers. The pattern remained stable across multiple model families, parameter scales, and datasets. These results suggest that CoT's benefits are largely due to local co-occurrence statistics rather than global derivation, implying that simpler prompting strategies may be as effective as full CoT in many cases.
Developers can optimize prompts by focusing on local token patterns rather than full logical chains.