Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
This arXiv paper, "Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems," published on May 27, 2026, addresses challenges faced by policy-gradient methods in long-horizon decision-making scenarios characterized by cumulative damage. Such problems involve actions that appear locally beneficial but can lead to globally detrimental outcomes over time. The research identifies two distinct failure modes for policy-gradient algorithms: "completion," which refers to the agent's ability to reach the terminal horizon rather than exiting prematurely due to implicit constraints, and "optimality," which measures how closely the agent's performance matches a dynamic-programming reference, assuming completion is achieved. To analyze these issues, the authors propose a decomposition that separates completion from optimality. Their experiments with Proximal Policy Optimization (PPO) using a linear soft penalty revealed that simply providing access to the full horizon can paradoxically reduce the completion rate. This occurs because the penalty's equilibrium drives the dominant-activity share to zero. While combining action-space restriction with horizon access improved completion, it introduced an optimality gap of 0.271, which was attributed to greedy commitments made early in the decision process. These findings were tested and qualitatively replicated across two distinct, calibrated environments: a 49-step bricklayer career simulation and a 20-season NBA power-forward career simulation, both sharing the same abstract problem structure despite differing domains and data.
Developers applying policy gradient methods to complex, long-term decision problems can use these insights to diagnose and mitigate issues related to task completion and optimal performance.