Cross-Entropy Games and Frost Training
Frost Training, introduced in arXiv:2605.27701, is a method for enhancing Monte Carlo-based policy optimization in Cross-Entropy Games, a family of LLM-as-a-judge tasks. The approach leverages the gradient of the reward function in embedding space, a signal previously used in the Greedy Coordinate Gradient (GCG) jailbreaking technique. This is the first demonstration that such gradients can also improve model training. The method was validated using GRPO training for maximum-likelihood infilling. Results show that Frost Training increases the model's ability to generate high-scoring outputs, achieving higher maximum scores in a best-of-k setting, and does so at an increased speed. The paper was published on arXiv on May 28, 2026.
Frost Training offers a faster way to improve LLM output quality for judge tasks.