Today's digest · Monday, June 1

The 11 things in AI/dev today.

LiveNext issue at 7:00 CET
#1 / TODAY
arXiv cs.AI·1 min·38h agoFREE

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

Researchers introduced PReMISE, a framework designed to improve the reliability and robustness of LLM judges by treating rubrics as measurement specifications. PReMISE discovers policy-level rubrics and audits existing ones across four axes, including structural adequacy and adversarial robustness. The framework found that no raw rubric source is simultaneously reliable, preference-predictive, and robust, highlighting the need for structured evaluation. PReMISE's repair operations can raise judge accuracy on paired responses from 65.0% to 68.6%.

Developers can leverage PReMISE to create more reliable and robust LLM evaluation systems, leading to better model development and deployment.

llmjudgesevaluationrubricsbenchmarking
arxiv.org
PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges
MAVEN: Improving Generalization in Agentic Tool Calling
#2 / TOP STORY
arXiv cs.AIFREE

MAVEN: Improving Generalization in Agentic Tool Calling

MAVEN (Modular Agentic Verification and Execution Network) is a lightweight symbolic reasoning scaffold that improves generalization in tool-calling for LLMs. It boosts GPT-OSS-120b accuracy from 48% to 71% on the new MAVEN-Bench without additional training, and competes with proprietary models using an open-weight backbone.

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories
#3 / TOP STORY
arXiv cs.AIFREE

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

Researchers introduced LinTree, a method to improve LLM reasoning by explicitly structuring search histories. Initial findings show that while raw access to search history alone doesn't reliably outperform heuristic search, making the underlying search tree explicit significantly enhances performance. This approach helps LLMs better utilize their intermediate reasoning traces, leading to more effective problem-solving in complex tasks like Blocks World, grid Navigation, and Sokoban.

aigest · daily

Get this every morning.

One email. The signal. Built for builders.

Free · Unsubscribe in one click · No trackers

// Today8 stories

Developers can more easily extend Datasette's functionality with plugins and integrate its data into diverse analytical pipelines through new export formats.

datasettepythondatasqlite
Simon Willison43h ago1mFREE
// Yesterday26 stories