arXiv cs.AIMonday · May 25, 2026FREE

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

llmkv-cacheobject-storageinference

ObjectCache, detailed in a new arXiv paper (2605.22850), proposes storing LLM prefix KV caches in S3-compatible object storage instead of expensive remote DRAM pools. The system co-designs the storage protocol and transfer schedule so that the storage server delivers KV cache data in the order the GPU consumes it, overlapping data transfer with compute across concurrent requests. The prototype runs on a 100 Gbps RoCE cluster using NIXL, Ceph RGW, and DAOS. For 64K contexts, ObjectCache adds only 5.6% latency over local DRAM; for 4K contexts, the overhead is not specified but implied to be low. This approach allows serving clusters to scale cache capacity without proportional cost increases, addressing a key bottleneck in LLM serving.

// why it matters

Reduces LLM serving costs by replacing expensive DRAM pools with cheap object storage.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.