A Policy-Driven Runtime Layer for Agentic LLM Serving
A new arXiv paper (2605.27744) argues that multi-agent LLM systems suffer from a mismatch between agent frameworks and serving engines, leading to ad-hoc patches for policies like prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, and safety enforcement. The authors propose inserting an agent runtime layer between the two, exposing four primitives: observe, score, predict, act. This layer uses agent identity as a shared coordinate, enabling any agent-aware policy to be plugged in. They map nine concrete policies onto the layer and validate the abstraction with CacheSage, a policy that learns per-agent KV caching across sessions to reduce serving costs. The paper claims this architectural change addresses the seam between framework and engine more effectively than point fixes.
Reduces serving costs by enabling efficient, agent-aware caching and policy execution.