LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
LGMT (Logic-Grounded Metamorphic Testing) is an oracle-free framework that leverages first-order logic (FOL) to evaluate the reasoning reliability of large language models. Unlike static benchmarks that assess isolated correctness, LGMT derives metamorphic relations from formal logical equivalences to construct semantically invariant test cases. By checking cross-case consistency, it detects reasoning defects that traditional reference-based evaluations miss. Experiments on six state-of-the-art LLMs showed that LGMT exposes substantial hidden defects, particularly under symbol-level and conclusion-level variations. Advanced prompting techniques like Few-shot CoT only partially mitigate these issues. The paper argues that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning reliability.
Developers can use LGMT to uncover hidden reasoning flaws in LLMs that static benchmarks miss.