Threshold-Based Exclusive Batching for LLM Inference
A paper on arXiv (2606.00516) investigates the performance of mixed batching (MB) versus exclusive batching (EB) for LLM inference. MB interleaves prefill and decode in a single batch to maximize compute and memory utilization, but controlled experiments reveal that prefill-decode interference increases MB's per-step marginal cost above pure decode. On the high-bandwidth H200 (4.8 TB/s), this occurs only when decode tokens exceed 80% of the batch; on the bandwidth-constrained RTX PRO 6000 (1.792 TB/s), the threshold drops to 20%. The authors derive a closed-form condition for the EB-MB crossover, optimal phase-switching thresholds, and memory-safe batch sizing for EB. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB retains its advantage on high-bandwidth hardware with larger models. A hybrid scheduler, EB+, dynamically switches between EB and MB based on the condition. The study provides practical guidance for selecting batching strategies based on GPU memory bandwidth, model size, and workload.
Helps developers choose batching strategy to maximize LLM inference throughput on different GPUs.