arXiv cs.AIMonday · June 1, 2026FREE

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

llmbenchmarkmemorizationevaluation

NumLeak is a measurement framework introduced in a new arXiv paper (2605.30393) that demonstrates how public numeric benchmarks appear in pretraining data, causing evaluation to measure memorized recall rather than out-of-sample skill. The authors tested top-tier frontier LLMs, including Claude and GPT-4, on financial data such as the Fama-French market excess return, achieving pooled Pearson r=0.97-0.99 while staying within 0.15 within-25bps on sibling factors. Similar fidelity was observed on U.S. unemployment, CPI inflation, and NOAA temperature. On a recent-release holdout, parse rate collapsed to 21-57% but r remained at approximately 0.99 on months answered, indicating a refuse-or-recall asymmetry consistent with memorization. A white-box experiment reproduced the dose-response, and logprob ranking detected memorization that open-ended generation misses, implying closed-API black-box probes understate the channel. For example, a Sonnet "date to market-sentiment" regression that correlated with true Mkt-RF at r=0.74 collapsed to r=0.02 once the model's own recall was residualized out. A one-line system-prompt defense blocks 99% of the leakage. The paper was published on arXiv on June 1, 2026.

// why it matters

Developers must account for benchmark memorization when evaluating LLMs on numeric tasks.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary Capability Self-Assessment: Teaching LLMs to Know Their Limits TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

Sources

Related

Like this? Get the next digest.