arXiv cs.AIMonday · May 25, 2026FREE

Parallel Context Compaction for Long-Horizon LLM Agent Serving

llm-agentscontext-compactioninference-optimization

A new paper introduces parallel compaction for long-horizon LLM agent serving, addressing the problem of growing conversation histories exceeding context windows. Traditional sequential compaction uses LLM-based summarization, which blocks agent inference for tens of seconds and provides no fine-grained control over summary volume due to prompt instruction neglect. Parallel compaction processes multiple context blocks concurrently, allowing operators to set precise summary sizes and enabling targeted prompt engineering per block. The method was evaluated on HotpotQA and LoCoMo benchmarks using four backbones from 8B to 120B parameters, including dense and MoE architectures with reasoning and non-reasoning models. Results show that parallel compaction matches the decode volume of sequential compaction while reducing latency and improving predictability. This approach is particularly relevant for long-horizon agents that accumulate extensive histories, such as those used in multi-turn dialogue or multi-hop QA tasks.

// why it matters

Reduces agent inference stalls and improves knowledge retention consistency for long-running LLM agents.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.

Parallel Context Compaction for Long-Horizon LLM Agent Serving — aigest.dev