arXiv cs.AITuesday · May 26, 2026FREE

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

llmtestingreasoningevaluation

LGMT (Logic-Grounded Metamorphic Testing) is an oracle-free framework that leverages first-order logic (FOL) to evaluate the reasoning reliability of large language models. Unlike static benchmarks that assess isolated correctness, LGMT derives metamorphic relations from formal logical equivalences to construct semantically invariant test cases. By checking cross-case consistency, it detects reasoning defects that traditional reference-based evaluations miss. Experiments on six state-of-the-art LLMs showed that LGMT exposes substantial hidden defects, particularly under symbol-level and conclusion-level variations. Advanced prompting techniques like Few-shot CoT only partially mitigate these issues. The paper argues that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning reliability.

// why it matters

Developers can use LGMT to uncover hidden reasoning flaws in LLMs that static benchmarks miss.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.