Today's digest · Monday, June 1

The 11 things in AI/dev today.

LiveNext issue at 7:00 CET

Stories

#1 / TODAY

arXiv cs.AI·1 min·38h agoFREE

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

Researchers introduced PReMISE, a framework designed to improve the reliability and robustness of LLM judges by treating rubrics as measurement specifications. PReMISE discovers policy-level rubrics and audits existing ones across four axes, including structural adequacy and adversarial robustness. The framework found that no raw rubric source is simultaneously reliable, preference-predictive, and robust, highlighting the need for structured evaluation. PReMISE's repair operations can raise judge accuracy on paired responses from 65.0% to 68.6%.

Developers can leverage PReMISE to create more reliable and robust LLM evaluation systems, leading to better model development and deployment.

llmjudgesevaluationrubricsbenchmarking

arxiv.org

PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges

MAVEN: Improving Generalization in Agentic Tool Calling

#2 / TOP STORY

arXiv cs.AIFREE

MAVEN: Improving Generalization in Agentic Tool Calling

MAVEN (Modular Agentic Verification and Execution Network) is a lightweight symbolic reasoning scaffold that improves generalization in tool-calling for LLMs. It boosts GPT-OSS-120b accuracy from 48% to 71% on the new MAVEN-Bench without additional training, and competes with proprietary models using an open-weight backbone.

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

#3 / TOP STORY

arXiv cs.AIFREE

LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories

Researchers introduced LinTree, a method to improve LLM reasoning by explicitly structuring search histories. Initial findings show that while raw access to search history alone doesn't reliably outperform heuristic search, making the underlying search tree explicit significantly enhances performance. This approach helps LLMs better utilize their intermediate reasoning traces, leading to more effective problem-solving in complex tasks like Blocks World, grid Navigation, and Sokoban.

aigest · daily

Get this every morning.

One email. The signal. Built for builders.

Free · Unsubscribe in one click · No trackers

// Today8 stories

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Developers building data analysis agents must prioritize state management over interaction steps.

benchmarkagentsdata-analysislong-horizon

arXiv cs.AI38h ago1mFREE

Exploring Autonomous Agentic Data Engineering for Model Specialization

Enables LLMs to autonomously curate domain-specific data, reducing human effort in model specialization.

llmdata-engineeringagentsspecialization

arXiv cs.AI38h ago1mFREE

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

Enables developers to build web agents that autonomously improve without costly expert data.

web-agentsmllmself-improvementexploration

arXiv cs.AI38h ago1mFREE

A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI

Developers must account for persona drift when using simulated evaluators for pluralistic AI alignment.

alignmentevaluationpersonasgenerative-ai

arXiv cs.AI38h ago1mFREE

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

Developers can deploy safer search agents without sacrificing utility or requiring large training datasets.

llmsafetyagentsalignment

arXiv cs.AI38h ago1mFREE

CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models

Provides a dynamic, multi-language benchmark for evaluating LLM code conciseness against human performance.

benchmarkllmcode-generationcode-golf

arXiv cs.AI38h ago1mFREE

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

Developers must account for benchmark memorization when evaluating LLMs on numeric tasks.

llmbenchmarkmemorizationevaluation

arXiv cs.AI38h ago1mFREE

datasette 1.0a32

Developers can more easily extend Datasette's functionality with plugins and integrate its data into diverse analytical pipelines through new export formats.

datasettepythondatasqlite

Simon Willison43h ago1mFREE

// Yesterday26 stories