arXiv cs.AIMonday · May 25, 2026FREE

Parallel Context Compaction for Long-Horizon LLM Agent Serving

llm-agentscontext-compactioninference-optimization

A new paper introduces parallel compaction for long-horizon LLM agent serving, addressing the problem of growing conversation histories exceeding context windows. Traditional sequential compaction uses LLM-based summarization, which blocks agent inference for tens of seconds and provides no fine-grained control over summary volume due to prompt instruction neglect. Parallel compaction processes multiple context blocks concurrently, allowing operators to set precise summary sizes and enabling targeted prompt engineering per block. The method was evaluated on HotpotQA and LoCoMo benchmarks using four backbones from 8B to 120B parameters, including dense and MoE architectures with reasoning and non-reasoning models. Results show that parallel compaction matches the decode volume of sequential compaction while reducing latency and improving predictability. This approach is particularly relevant for long-horizon agents that accumulate extensive histories, such as those used in multi-turn dialogue or multi-hop QA tasks.

// why it matters

Reduces agent inference stalls and improves knowledge retention consistency for long-running LLM agents.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

Parallel Context Compaction for Long-Horizon LLM Agent Serving

Sources

Related

Like this? Get the next digest.