arXiv cs.AIWednesday · May 27, 2026FREE

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

policy-gradientreinforcement-learninglong-horizoncumulative-damageppo

This arXiv paper, "Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems," published on May 27, 2026, addresses challenges faced by policy-gradient methods in long-horizon decision-making scenarios characterized by cumulative damage. Such problems involve actions that appear locally beneficial but can lead to globally detrimental outcomes over time. The research identifies two distinct failure modes for policy-gradient algorithms: "completion," which refers to the agent's ability to reach the terminal horizon rather than exiting prematurely due to implicit constraints, and "optimality," which measures how closely the agent's performance matches a dynamic-programming reference, assuming completion is achieved. To analyze these issues, the authors propose a decomposition that separates completion from optimality. Their experiments with Proximal Policy Optimization (PPO) using a linear soft penalty revealed that simply providing access to the full horizon can paradoxically reduce the completion rate. This occurs because the penalty's equilibrium drives the dominant-activity share to zero. While combining action-space restriction with horizon access improved completion, it introduced an optimality gap of 0.271, which was attributed to greedy commitments made early in the decision process. These findings were tested and qualitatively replicated across two distinct, calibrated environments: a 49-step bricklayer career simulation and a 20-season NBA power-forward career simulation, both sharing the same abstract problem structure despite differing domains and data.

// why it matters

Developers applying policy gradient methods to complex, long-term decision problems can use these insights to diagnose and mitigate issues related to task completion and optimal performance.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

Sources

Related

Like this? Get the next digest.