arXiv cs.AIWednesday · May 27, 2026FREE

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

llm-evaluationreasoningbenchmarkspost-training

A paper on arXiv (2605.26789) introduces composition collapse, a phenomenon where language models fail to assemble stably-known facts into multi-hop reasoning chains, even when aggregate benchmark scores suggest strong performance. The authors show that post-training recipes with statistically indistinguishable atomic knowledge can produce composition behavior separated by over 40 percentage points. They propose a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access. This decomposes post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2–11 across four post-training recipes, the decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask. The paper suggests that claims about multi-hop reasoning improvement should be accompanied by evidence of stable atomic access and residual composition gains.

// why it matters

Aggregate benchmarks can hide reasoning failures; developers need finer-grained evaluation to trust model composition.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

Sources

Related

Like this? Get the next digest.