LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
LiveK12Bench, introduced in a new arXiv paper, addresses limitations of static benchmarks by providing a continuously updated set of over 2,000 verified questions from mathematics, physics, chemistry, and biology. The questions are sourced from the latest real-world exam papers, and an automated pipeline ingests and parses new exams to mitigate data contamination. A novel 'Mock Exam' evaluation scheme assesses models' end-to-end examination abilities. The benchmark is designed to grow over time, ensuring it remains relevant as curricula evolve. This work highlights the gap between current LMM performance on static benchmarks and real-world exam scenarios, pushing for more robust evaluation.
Developers building AI tutors need benchmarks that reflect real exam conditions to avoid overestimating model capabilities.