Today's digest · Wednesday, May 27

The 41 things in AI/dev today.

LiveNext issue at 7:00 CET

Stories

3 top · 38 rest

#1 / TODAY

arXiv cs.AI·1 min·39h agoFREE

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne is an autonomous research system that uses Chain-of-Evidence (CoE) to ensure every claim is traceable to its source. In tests across 75 papers, it achieved zero hallucinated references, while baselines had up to 21% hallucination rates and score verification passed in as few as 42% of papers.

Enables trustworthy autonomous research by eliminating hallucinated references and ensuring verifiable claims.

autonomous-researchverifiabilityai-agentsarxiv

arxiv.org

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

#2 / TOP STORY

arXiv cs.AIFREE

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

Researchers introduced PolyFusionAgent, an autonomous AI assistant combining a multimodal foundation model (PolyFusion) with a tool-augmented design agent (PolyAgent) for polymer discovery. PolyFusion aligns diverse polymer representations to predict thermophysical properties and generate novel structures. PolyAgent integrates literature retrieval to evaluate and contextualize designs, aiming to overcome fragmented data and accelerate the development of new materials for fields like energy storage and biomedicine by providing actionable design decisions.

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

#3 / TOP STORY

arXiv cs.AIFREE

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

Researchers introduced a new evaluation problem for legal AI, highlighting that current LLMs struggle to distinguish legally relevant changes from irrelevant ones. Their unified evaluation suite revealed existing models are systematically sensitive to legally immaterial variations. To address this, they developed LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints and uses SMT solvers to verify legal satisfaction, significantly improving legal reasoning reliability and reducing vulnerability to manipulative framing.

aigest · daily

Get this every morning.

One email. The signal. Built for builders.

Free · Unsubscribe in one click · No trackers

// Today38 stories

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Developers should consider Chain-of-Thought prompting for mathematical tasks requiring high robustness against minor input variations, even over code execution.

llmsmathreasoningcodeexecution

arXiv cs.AI39h ago1mFREE

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

Enables efficient, lossless prompt compression for LLM agents without inference overhead.

llm-agentsprompt-compressionarxiv

arXiv cs.AI39h ago1mFREE

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

Enables fully on-device mobile GUI agents, reducing privacy risks and latency.

mobile-gui-agentson-device-inferencevlmexploration

arXiv cs.AI39h ago1mFREE

Experiments in Agentic AI for Science

Enables autonomous AI agents to handle complex scientific data curation and analysis tasks.

agentic-aiscientific-workflowsllmrag

arXiv cs.AI39h ago1mFREE

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Anchor enables reliable, scalable evaluation of AI agents for enterprise automation.

agentsbenchmarkenterpriseconstraint-optimization

arXiv cs.AI39h ago1mFREE

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Developers gain a more rigorous method to test if LLMs truly understand mental states, not just mimic answers.

llmtheory-of-mindbenchmarksocial-reasoning

arXiv cs.AI39h ago1mFREE

JobBench: Aligning Agent Work With Human Will

Shifts AI agent evaluation from economic replacement to human-centered delegation.

agentsbenchmarkai-safetyarxiv

arXiv cs.AI39h ago1mFREE

Automatic Layer Selection for Hallucination Detection

Enables reliable hallucination detection without manual layer tuning, reducing engineering overhead.

hallucination-detectionllmlayer-selectionarxiv

arXiv cs.AI39h ago1mFREE

Advancing Creative Physical Intelligence in Large Multimodal Models

Highlights a critical gap in LMMs' ability to sustain grounded reasoning for creative problem-solving.

multimodal-modelsbenchmarkcreative-reasoningai-research

arXiv cs.AI39h ago1mFREE

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

Developers applying policy gradient methods to complex, long-term decision problems can use these insights to diagnose and mitigate issues related to task completion and optimal performance.

policy-gradientreinforcement-learninglong-horizoncumulative-damage

arXiv cs.AI39h ago2mFREE

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Developers can now better understand and improve LLM agent planning for code generation tasks.

llm-agentscudakernel-generationfeedback

arXiv cs.AI39h ago1mFREE

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

Developers must tune agent harness structure per model reasoning type, not just capability tier.

llm-agentsharness-designbenchmarkingarxiv

arXiv cs.AI39h ago1mFREE

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Developers must consider CoT as an attack surface for jailbreaking reasoning models.

chain-of-thoughtrefusalsafetyreasoning-models

arXiv cs.AI39h ago1mFREE

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

Developers can optimize prompts by focusing on local token patterns rather than full logical chains.

chain-of-thoughtpromptingllmreasoning

arXiv cs.AI39h ago1mFREE

Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

Developers building LLM systems with diverse user groups can achieve more stable and fair outputs.

llmalignmentmulti-stakeholderarxiv

arXiv cs.AI39h ago1mFREE

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

Enables automated detection of defects in AI-generated peer reviews, improving review quality at conferences.

llmpeer-reviewagentsai-safety

arXiv cs.AI39h ago1mFREE

From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

Enables reliable, traceable automated legal analysis for developers building compliance or policy tools.

raglegal-aiagentsllm

arXiv cs.AI39h ago1mFREE

Generating Robust Portfolios of Optimization Models using Large Language Models

Enables more reliable automated optimization model generation from natural language.

llmoptimizationportfolioarxiv

arXiv cs.AI39h ago1mFREE

ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis

Makes causal analysis accessible to domain experts without requiring deep methodological knowledge.

causal-analysiscopilotagentsroot-cause-analysis

arXiv cs.AI39h ago1mFREE

Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

Enables better general capability recovery in domain-specialized LLMs without requiring teacher-aligned prompts.

llmdistillationdomain-specializationon-policy

arXiv cs.AI39h ago1mFREE

Position: AI Safety Requires Effective Controllability

Developers must prioritize controllability alongside alignment to ensure AI systems can be safely managed in real-world deployments.

ai-safetycontrollabilityalignmentagents

arXiv cs.AI39h ago1mFREE

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Enables systematic RL optimization of LLM multi-agent workflows beyond manual prompting.

reinforcement-learningmulti-agentllmframework

arXiv cs.AI39h ago1mFREE

Can LLMs Introspect? A Reality Check

Developers relying on LLM introspection for debugging or alignment may need more robust methods.

llmintrospectionmetacognitionarxiv

arXiv cs.AI39h ago1mFREE

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

Improves real-world reliability of conversational AI by reducing compounding errors over multiple turns.

reinforcement-learningdialoguellmdistribution-shift

arXiv cs.AI39h ago1mFREE

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

Enables targeted debugging of LLM memory failures for more reliable long-horizon agents.

llmmemorybenchmarkagents

arXiv cs.AI39h ago1mFREE

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

Developers cannot trust RAG outputs without internal verification, as models may rely on memory instead of retrieved context.

ragattributionai-safetyevaluation

arXiv cs.AI39h ago1mFREE

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Developers building AI tutors need benchmarks that reflect real exam conditions to avoid overestimating model capabilities.

multimodalbenchmarkeducationevaluation

arXiv cs.AI39h ago1mFREE

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

Aggregate benchmarks can hide reasoning failures; developers need finer-grained evaluation to trust model composition.

llm-evaluationreasoningbenchmarkspost-training

arXiv cs.AI39h ago1mFREE

Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

Enables automated, trustworthy multi-hop reasoning for complex supply chain queries.

llmmulti-agentknowledge-graphsupply-chain

arXiv cs.AI39h ago1mFREE

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

Enables safer LLM deployment in regulated industries by combining formal guarantees with neural detection.

llmverificationneuro-symbolichallucination

arXiv cs.AI39h ago1mFREE

LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

Developers can now perform entity linking without domain-specific training, reducing setup time and improving portability.

entity-linkingnerllmpython-library

arXiv cs.AI39h ago1mFREE

[AINews] New AI Infra decacorns: Fireworks, Baseten (with OpenRouter on the way)

AI infrastructure companies are becoming the new power brokers, shaping access and cost of model inference.

ai-infrastructurefundingdecacorninference

Latent Space39h ago1mFREE

Build highly scalable serverless LangGraph multi-agent systems in AWS with Amazon Bedrock AgentCore

Simplifies building and scaling complex multi-agent AI systems on AWS.

awslanggraphbedrockagents

AWS ML Blog2d ago1mFREE

Build high-performance generative AI systems with Strands Agents, NVIDIA NIM, and Amazon Bedrock AgentCore

Enables scalable, observable multi-agent AI systems with GPU acceleration and managed orchestration.

awsnvidiaagentsbedrock

AWS ML Blog2d ago1mFREE

AgentWatch: Proactive AWS monitoring with ambient agents

AgentWatch reduces manual monitoring overhead by automating infrastructure checks and enabling natural language queries.

awsmonitoringagentsdevops

AWS ML Blog2d ago1mFREE

From idea to AI app: Creating intelligent research assistants with Strands

Strands cuts AI app development from months to days, lowering the barrier for building autonomous research assistants.

awsstrandsai-assistantsframework

AWS ML Blog2d ago1mFREE

Build an enterprise observability solution for Amazon Quick

Enables developers to build and monitor AI platforms with centralized usage insights.

awsobservabilityamazon-quickenterprise

AWS ML Blog2d ago1mFREE

Transforming professional work: How Amazon Quick turns document creation from hours into minutes

Developers can integrate Quick's API to automate document generation in their apps, saving hours of manual work.

awsaidocument-generationproductivity

AWS ML Blog2d ago1mFREE

// Yesterday38 stories