arXiv cs.AIThursday · May 28, 2026FREE

A Policy-Driven Runtime Layer for Agentic LLM Serving

llm-servingagentscachingarchitecture

A new arXiv paper (2605.27744) argues that multi-agent LLM systems suffer from a mismatch between agent frameworks and serving engines, leading to ad-hoc patches for policies like prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, and safety enforcement. The authors propose inserting an agent runtime layer between the two, exposing four primitives: observe, score, predict, act. This layer uses agent identity as a shared coordinate, enabling any agent-aware policy to be plugged in. They map nine concrete policies onto the layer and validate the abstraction with CacheSage, a policy that learns per-agent KV caching across sessions to reduce serving costs. The paper claims this architectural change addresses the seam between framework and engine more effectively than point fixes.

// why it matters

Reduces serving costs by enabling efficient, agent-aware caching and policy execution.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.