CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials
Miller-index identification from powder XRD patterns poses a unique challenge for multimodal benchmarks, requiring models to accurately read narrow peak locations from rendered scientific curves and then apply multi-step crystallographic reasoning. Introduced in a paper published on arXiv cs.AI on 2026-05-29, CrystalXRD-Bench comprises 250 samples derived from 10 public crystallographic databases. Each sample pairs a rendered XRD image with its source CIF text and chemical formula, enabling detailed analysis of visual extraction and reasoning errors. The benchmark's core task is to recover the full set of HKLs contributing to the highest-intensity peak in an XRD pattern. Seven vision-language models were evaluated, with GPT-5.4 achieving the best Jaccard score of 0.5888 and an exact-match rate of 37.6%. Six of the seven models scored below Jaccard 0.50, highlighting the difficulty of the task. Systematic error patterns observed include particular brittleness in double-peak cases, recall-heavy models over-predicting HKLs for coverage, and the finding that access to CIF text does not significantly improve crystallographic calculation accuracy.
Developers building AI for scientific domains face significant challenges in enabling models to interpret complex visual data and apply multi-step domain-specific reasoning.