arXiv cs.AIWednesday · May 27, 2026FREE

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

multimodalbenchmarkeducationevaluation

LiveK12Bench, introduced in a new arXiv paper, addresses limitations of static benchmarks by providing a continuously updated set of over 2,000 verified questions from mathematics, physics, chemistry, and biology. The questions are sourced from the latest real-world exam papers, and an automated pipeline ingests and parses new exams to mitigate data contamination. A novel 'Mock Exam' evaluation scheme assesses models' end-to-end examination abilities. The benchmark is designed to grow over time, ensuring it remains relevant as curricula evolve. This work highlights the gap between current LMM performance on static benchmarks and real-world exam scenarios, pushing for more robust evaluation.

// why it matters

Developers building AI tutors need benchmarks that reflect real exam conditions to avoid overestimating model capabilities.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Sources

Related

Like this? Get the next digest.