arXiv cs.AIWednesday · May 27, 2026FREE

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

llm-evaluationreasoningbenchmarkspost-training

A paper on arXiv (2605.26789) introduces composition collapse, a phenomenon where language models fail to assemble stably-known facts into multi-hop reasoning chains, even when aggregate benchmark scores suggest strong performance. The authors show that post-training recipes with statistically indistinguishable atomic knowledge can produce composition behavior separated by over 40 percentage points. They propose a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access. This decomposes post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2–11 across four post-training recipes, the decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask. The paper suggests that claims about multi-hop reasoning improvement should be accompanied by evidence of stable atomic access and residual composition gains.

// why it matters

Aggregate benchmarks can hide reasoning failures; developers need finer-grained evaluation to trust model composition.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning — aigest.dev