PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
PALS, introduced in arXiv paper 2605.21427, is a power-aware runtime for LLM serving that integrates GPU power caps as a first-class control knob alongside software parameters like batch size. The system uses lightweight offline power-performance models and a feedback-driven controller to select configurations meeting throughput targets while maximizing energy efficiency. Implemented within the vLLM framework, PALS requires no model retraining or API changes. Across multi-GPU systems and both dense and mixture-of-experts (MoE) models, PALS improves energy efficiency by up to 26.3%, reduces QoS violations by 4x to 7x under power constraints, and tracks dynamic power budgets. These results highlight the potential of integrating power control directly into LLM serving systems to reduce energy consumption and improve reliability in data centers.
Enables significant energy savings and reduced QoS violations in LLM serving without model changes.