Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum
The paper, published on arXiv (2605.20196v1), investigates the mechanism behind real-data scaling laws. The authors construct a suffix-automaton representation of text corpora and define a global-KL predictive contribution spectrum, where each state's contribution is its empirical mass times KL deviation from a global next-token baseline. Across 12 real corpora, the tail slope of this spectrum strongly correlates with the empirical data-scaling exponent of a fixed small GPT learner. For each training size N, they define an effective truncation rank K(N) by matching observed excess loss to the residual tail mass of the prepared 1000k global-KL spectrum. Empirically, log K is nearly linear in log N, with pooled R² ≈ 0.96 for the raw spectrum and R² ≈ 0.90 for the smoothed spectrum. These findings support a simple mechanism: training scale advances an effective frontier through a predictive state spectrum, with residual tail mass governing loss.
Provides a principled framework for predicting scaling behavior and optimizing data curation.