SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
A new study from arXiv cs.AI introduces SpaceNum, a unified framework to evaluate spatial numerical understanding in Vision-Language Models (VLMs). The framework captures two settings: numbers as dynamic transitions during spatial exploration and numbers as static layouts in spatial reasoning. It formulates bidirectional tasks Num2Space and Space2Num to test mapping between vision-side spatial structure and language-side numerical representations. Across both settings, models largely fail to ground numbers in spatial meaning, often performing close to random guess. Error analysis, reasoning trace analysis, and controlled interventions reveal that current VLMs rely heavily on shallow spatial cues rather than genuine numerical understanding. This has direct implications for deploying VLMs in embodied environments where they must produce numerical outputs such as action magnitudes and spatial coordinates.
VLMs cannot be trusted for numerical spatial tasks in robotics or autonomous systems.