arXiv cs.AISaturday · May 23, 2026FREE

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

llmpower-managementenergy-efficiencyvllm

PALS, introduced in arXiv paper 2605.21427, is a power-aware runtime for LLM serving that integrates GPU power caps as a first-class control knob alongside software parameters like batch size. The system uses lightweight offline power-performance models and a feedback-driven controller to select configurations meeting throughput targets while maximizing energy efficiency. Implemented within the vLLM framework, PALS requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM serving systems to reduce energy consumption and improve reliability in data centers.

// why it matters

Enables significant energy savings and reduced QoS violations in LLM serving without model changes.

Sources

Primary · arXiv cs.AI
▸ Read original at arxiv.org

Like this? Get the next digest.