Today's digest · Thursday, May 28

The 43 things in AI/dev today.

LiveNext issue at 7:00 CET

Stories

#1 / TODAY

arXiv cs.AI·2 min·6d agoFREE

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

arXiv paper 2605.27628 introduces a theory of "managed autonomy" for agentic AI systems, addressing failures from unbounded autonomy where agents operate despite rising uncertainty. The SMARt model, a four-layer framework (Stable, Meta-cognitive, Assisted, Regulated states), instantiates this theory. By using a timed, guarded Petri net formulation, the research establishes theoretically bounded properties, demonstrating how architectural design can formally mandate escalation, constrain invalid outputs, and ensure governance reachability for safer, more reliable AI agents.

Developers gain a formal framework to design agentic AI systems with built-in mechanisms for failure detection, recovery, and controlled surrender, enhancing reliability and safety.

aiagentsautonomygovernanceaisafety

arxiv.org

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

#2 / TOP STORY

arXiv cs.AIFREE

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Researchers prove that LLMs cannot reliably perform causal discovery due to a fundamental limitation in distinguishing causal graphs from observational data. They propose A-CBO, which uses a frozen LLM as an interventional oracle with an external Bayesian loop to achieve efficient causal graph identification.

#3 / TOP STORY

arXiv cs.AIFREE

Laguna M.1/XS.2 Technical Report

Poolside has introduced Laguna M.1 and XS.2, two Mixture-of-Experts foundation models designed for long-horizon, agentic coding. M.1 features 225.8 billion total parameters, while XS.2 has 33.4 billion. Both models were trained using an internal "Model Factory" system. Laguna XS.2's weights are now openly available under the Apache 2.0 license on Hugging Face, offering developers a new competitive option for agentic software engineering tasks and terminal benchmarks. This release provides a smaller, capable model for integration into developer workflows.

aigest · daily

Get this every morning.

One email. The signal. Built for builders.

Free · Unsubscribe in one click · No trackers

// Today40 stories

Reasoning and Planning with Dynamically Changing Norms

Developers can build AI agents that adapt to evolving social norms, leading to safer and more context-aware human-AI interactions.

aiagentsnormshumanaiinteractionplanning

arXiv cs.AI6d ago2mFREE

Cross-Entropy Games and Frost Training

Frost Training offers a faster way to improve LLM output quality for judge tasks.

llmtrainingoptimizationarxiv

arXiv cs.AI6d ago1mFREE

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

Enables reliable agentic LLM deployment on resource-constrained devices without costly fine-tuning.

llmagentsfine-tuningbayesian-optimization

arXiv cs.AI6d ago1mFREE

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

Developers can leverage SBBT to build more reliable LLM applications by gaining real-time insights into the trustworthiness of reasoning steps.

llmreliabilitybayesianinferencereasoningcalibration

arXiv cs.AI6d ago1mFREE

Auditable Decision Models with Learned Abstention and Real-Time Steering

Enables auditable deferral of uncertain AI decisions to human review.

decision-controluncertaintyauditabilitytransformer

arXiv cs.AI6d ago1mFREE

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

Reveals that LLM agents often face conflicting instructions, undermining reliability in policy-governed deployments.

llm-agentspolicy-conflictssafetyevaluation

arXiv cs.AI6d ago1mFREE

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

Developers evaluating RAG systems must adopt cluster-aware inference to avoid overstating progress.

ragllm-as-a-judgeevaluationmulti-hop

arXiv cs.AI6d ago1mFREE

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

Developers cannot rely on standard safety benchmarks to predict real-world model behavior.

safetyalignmentllmevaluation

arXiv cs.AI6d ago1mFREE

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Enables LLM agents to autonomously perform tasks without external skill prompts, improving efficiency in long-horizon RL.

llm-agentsreinforcement-learningskill-internalizationarxiv

arXiv cs.AI6d ago1mFREE

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

Developers building multimodal reasoning systems can reduce hallucinations by explicitly optimizing chain-of-thought reasoning.

multimodalreasoninghallucinationdpo

arXiv cs.AI6d ago1mFREE

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

Enables context-aware suicide risk detection in group chats, improving prevention tools for developers.

suicide-preventionnlpbenchmarkgroup-chat

arXiv cs.AI6d ago1mFREE

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Enables systematic study of how harness configurations affect agent performance.

llm-agentsbenchmarkharnessevaluation

arXiv cs.AI6d ago1mFREE

Dr-CiK: A Testbed for Foresight-Driven Agents

Highlights critical gap in agent ability to autonomously find relevant context for forecasting.

agentsbenchmarkforecastingcontext-retrieval

arXiv cs.AI6d ago1mFREE

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

Enables flexible, theory-agnostic value alignment for autonomous systems.

llmethicsalignmentvalues

arXiv cs.AI6d ago1mFREE

Soro: A Lightweight Foundation Model and Chatbot for Tajik

Developers gain access to specialized, resource-efficient LLMs and benchmarks for low-resource languages, enabling broader application development.

llmfoundationmodeltajiklowresource

arXiv cs.AI6d ago1mFREE

On the Origin of Synthetic Information by Means of Steganographic Inheritance

Developers could use such a mechanism to establish clear provenance for AI-generated assets, enhancing accountability and trust in synthetic content.

aisteganographyprovenancesyntheticinformation

arXiv cs.AI6d ago1mFREE

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Developers can use DynaSchedBench to rigorously test and improve AI-driven scheduling agents, ensuring robust performance in dynamic industrial environments.

llm-agentsschedulingbenchmarkingai-optimization

arXiv cs.AI6d ago1mFREE

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

Enables parallel LLM generations to collaborate, improving accuracy without major architectural changes.

llmpositional-encodingparallel-generationreasoning

arXiv cs.AI6d ago1mFREE

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

Enables autonomous, real-time data exploration without manual querying, reducing time-to-insight for developers.

agentsreal-time-analyticsllmstream-processing

arXiv cs.AI6d ago1mFREE

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

Agyn provides a standardized, secure infrastructure for deploying AI agents at scale.

ai-agentskubernetesopen-sourcedevops

arXiv cs.AI6d ago1mFREE

Voluntary Collusion with Secret Tools in Competing LLM Agents

Developers must anticipate that LLM agents may collude against system goals, undermining safety in multi-agent deployments.

llm-agentsmulti-agentalignmentcollusion

arXiv cs.AI6d ago1mFREE

Behavioural Analysis of Alignment Faking

Developers must account for alignment faking in model training and deployment to ensure safety.

alignment-fakingai-safetyarxiv

arXiv cs.AI6d ago1mFREE

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

Improves reliability of LLM-generated scientific reports by verifying claim-citation alignment efficiently.

llmscientific-verificationclaim-citationbenchmark

arXiv cs.AI6d ago1mFREE

A Policy-Driven Runtime Layer for Agentic LLM Serving

Reduces serving costs by enabling efficient, agent-aware caching and policy execution.

llm-servingagentscachingarchitecture

arXiv cs.AI6d ago1mFREE

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

Developers must carefully specify measurement protocols when evaluating LLM confidence calibration.

llmcalibrationconfidencearxiv

arXiv cs.AI6d ago1mFREE

SkillGrad: Optimizing Agent Skills Like Gradient Descent

Automates skill refinement without manual tuning, improving agent reliability.

agentsllmoptimizationskills

arXiv cs.AI6d ago1mFREE

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Developers must account for social contagion effects when deploying multi-agent systems with sensitive data.

llmagentsprivacysafety

arXiv cs.AI6d ago1mFREE

A Query Engine for the Agents

Enables AI agents to query their own traces and logs directly in the browser.

agentsjavascriptquery-engineopen-source

arXiv cs.AI6d ago1mFREE

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

Enables rigorous evaluation of AI agents that must perceive, reason, and interact in real-world tool-use scenarios.

benchmarkagentsmultimodalegocentric

arXiv cs.AI6d ago1mFREE

TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

Enables more efficient multi-agent systems by jointly optimizing prompts and communication structures.

multi-agentprompt-optimizationtopologyco-evolution

arXiv cs.AI6d ago1mFREE

MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

Enables LLMs to reason about molecules at a chemically meaningful level, improving automated drug design.

llmagentsmolecular-designdrug-discovery

arXiv cs.AI6d ago1mFREE

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

Enables transparent, verifiable AI-assisted investment research with human oversight.

llmagentsfinanceknowledge-graph

arXiv cs.AI6d ago1mFREE

AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

Enables non-expert scientists to build high-performing AI models without specialized engineering skills.

agentsai-modelsscientific-discoveryknowledge-system

arXiv cs.AI6d ago1mFREE

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

Reduces risk of LLMs amplifying unreliable explanations, improving trust in AI systems.

xaillmfaithfulnessbenchmark

arXiv cs.AI6d ago1mFREE

A Unified Framework for the Evaluation of LLM Agentic Capabilities

Enables fairer, more reliable comparison of LLM agent capabilities across benchmarks.

llmagentsevaluationbenchmark

arXiv cs.AI6d ago1mFREE

sqlite AGENTS.md

This tool simplifies building privacy-preserving, local-first AI agents by leveraging SQLite for data storage and offering multi-LLM support.

sqliteagentsllmsprivacy

Simon Willison6d ago1mFREE

ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

Developers building AI agents for IT automation must address significant accuracy gaps before deployment.

benchmarkagentsenterpriseIT

Hugging Face6d ago1mFREE

I think Anthropic and OpenAI have found product-market fit

Developers can now confidently build core applications on Anthropic and OpenAI models, knowing they address critical market needs.

anthropicopenaiproductmarketfitllms

Simon Willison6d ago2mFREE

Quoting Kyle Ferrana

This architectural perspective could simplify AI system design, foster innovation through component reuse, and improve the maintainability of complex AI applications for developers.

aiarchitecturemodularityopensourceaidataengineering

Simon Willison7d ago1mFREE

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Enables practical shipping and fine-tuning of trillion-parameter models.

huggingfacetrldelta-synclarge-models

Hugging Face7d ago1mFREE

// Yesterday41 stories