arXiv cs.AITuesday · June 2, 2026FREE

Threshold-Based Exclusive Batching for LLM Inference

llm-inferencebatchinggpuperformance

A paper on arXiv (2606.00516) investigates the performance of mixed batching (MB) versus exclusive batching (EB) for LLM inference. MB interleaves prefill and decode in a single batch to maximize compute and memory utilization, but controlled experiments reveal that prefill-decode interference increases MB's per-step marginal cost above pure decode. On the high-bandwidth H200 (4.8 TB/s), this occurs only when decode tokens exceed 80% of the batch; on the bandwidth-constrained RTX PRO 6000 (1.792 TB/s), the threshold drops to 20%. The authors derive a closed-form condition for the EB-MB crossover, optimal phase-switching thresholds, and memory-safe batch sizing for EB. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB retains its advantage on high-bandwidth hardware with larger models. A hybrid scheduler, EB+, dynamically switches between EB and MB based on the condition. The study provides practical guidance for selecting batching strategies based on GPU memory bandwidth, model size, and workload.

// why it matters

Helps developers choose batching strategy to maximize LLM inference throughput on different GPUs.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.