arXiv cs.AIMonday · May 25, 2026FREE

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

vision-language-modelsbenchmarksevaluationhallucination

A new arXiv paper (2605.22903) systematically investigates whether vision-language models (VLMs) actually use visual evidence when answering benchmark questions. The authors observed that removing a substantial fraction of image tokens from a widely used hallucination benchmark only slightly degraded model performance. They then conducted experiments across multiple levels: global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. Layer-wise analysis of vision-token geometry complemented the behavioral results. The findings show that while VLMs do incorporate visual input, their predictions are less sensitive to loss of fine-grained visual evidence than standard accuracy metrics suggest. Even when final predictions remain unchanged, internal support for correct answers may already be weakened. The study spans several open-source VLMs and highlights a mismatch between benchmark scores and true visual understanding.

// why it matters

Developers relying on benchmark scores may overestimate VLM visual grounding, leading to brittle real-world performance.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision? — aigest.dev