arXiv cs.AIMonday · May 25, 2026FREE

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

transformerskv-cachelong-contextattention

Tensor Cache, described in arXiv:2605.22884, proposes a two-level caching mechanism for autoregressive Transformers. The first level (L1) is standard sliding-window softmax attention, which limits memory but discards evicted tokens. The second level (L2) is a fixed-size outer-product fast-weight memory that stores KV pairs evicted from the window. These are compressed into a per-layer matrix A, and future queries read from it via a single matrix multiplication, leveraging the linear-attention identity. A learned scalar gate fuses L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The key contribution is using outer-product memory exclusively for evicted tokens, avoiding the chunked-mean training shortcut that introduces spurious correlations. This approach maintains exact local attention for recent tokens while preserving access to older context without growing memory. The paper identifies that common chunked-mean training shortcuts silently introduce C^2-C spurious terms, which Tensor Cache avoids. The method is evaluated on language modeling tasks, showing improved perplexity and retrieval accuracy compared to sliding-window baselines, with minimal overhead.

// why it matters

Enables efficient long-context inference without full KV cache growth.

Sources

Primary · arXiv cs.AI

▸ Read original at arxiv.org

Inference Time Context Sparsity: Illusion or Opportunity?

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

Sources

Related

Like this? Get the next digest.