DEV CommunityTuesday · May 19, 2026FREE

I built an open-source LLM eval framework as a BCA student — hallucination detection, red-teaming, regression tracking

llmevaluationopen-sourcehallucination-detection

Ayush Khati, a BCA student from Siliguri, India, released an open-source LLM evaluation framework on GitHub. The framework runs a 27-test suite that assesses factual accuracy, safety refusals, hallucination resistance, adversarial prompts, and reasoning. It scores outputs using a three-tier judge chain: semantic similarity, an LLM judge, and a regex fallback. The system auto-generates adversarial prompt attacks for red-teaming any endpoint and tracks regressions across model versions. A live dashboard shows pass/fail rates and per-test inspection. The hallucination scorer hit 86% classification accuracy versus a 50% random baseline on a 50-case benchmark. The tech stack includes Flask, SQLAlchemy, Groq SDK, PostgreSQL, Next.js, and Framer Motion. The entire deployment runs on free tiers: Render (backend), Vercel (frontend), Neon (database), and Upstash (caching). A live demo is available at https://llm-eval-silk.vercel.app/, and the API health endpoint is at https://llm-eval-55pg.onrender.com/api/health. The project is hosted on GitHub at https://github.com/AyushkhatiDev/llm-eval, with a research note at https://github.com/AyushkhatiDev/llm-eval/blob/main/FINDINGS.md.

// why it matters

Provides a free, open-source tool for LLM evaluation and red-teaming.

Sources

Primary · DEV Community
▸ Read original at dev.to

Like this? Get the next digest.

I built an open-source LLM eval framework as a BCA student — hallucination detection, red-teaming, regression tracking — aigest.dev