Research Papers research paper arxiv ai artificial-intelligence

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

arXivMarch 31, 202610 min read0 views

arXiv:2603.04427v4 Announce Type: replace-cross Abstract: Standard Transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(\log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically — Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang

View PDF HTML (experimental)

Abstract:Standard Transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(\log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch -- unlike Grouped-Query Attention (GQA) and Multi-Head Latent Attention (MLA), which must be designed into the architecture before pretraining. We factorize each key projection $W_K \approx A_{d \times r} B_{r \times d}$ via truncated singular value decomposition (SVD) (where $r$ is the chosen compression dimension), set $W_K' = A$ as the new key projection producing compact $r$-dimensional keys for the cache, and absorb $B^\top$ into the query projection ($W_Q' = W_Q B^\top$) at zero cost -- since queries are never cached. At the 7B scale, training from scratch with $r = d/4$ (where $d$ is the model dimension) matches full-attention perplexity ($9.24$ vs $9.25$ PPL after 20B tokens, mean over two seeds) while using 12% fewer parameters and training 8% faster. For existing models, SVD followed by QK fine-tuning (3 epochs, less than 1% of pretraining data) achieves 75% key cache savings at roughly 2% quality cost on both GPT-2 and Mistral-7B. The approach composes with GQA and quantization for up to $16\times$ combined key cache compression. For a 7B model serving a 128K context, factored keys save 25 GB of KV cache per user, enabling roughly 60% more concurrent users on identical hardware.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.04427 [cs.LG]

(or arXiv:2603.04427v4 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.04427

arXiv-issued DOI via DataCite

Submission history

From: Yao Hengshuai [view email] [v1] Mon, 16 Feb 2026 23:45:39 UTC (16 KB) [v2] Tue, 17 Mar 2026 17:54:04 UTC (39 KB) [v3] Wed, 18 Mar 2026 15:05:32 UTC (39 KB) [v4] Sat, 28 Mar 2026 06:06:40 UTC (40 KB)

Original source

arXiv

https://arxiv.org/abs/2603.04427

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research Papers

Philipp Müller starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m5 months ago

Research Papers

We are hiring a new Max Planck Research Group Leader at the MPI for Intelligent Systems in Stuttgart

is.mpg.de

1m4 months ago

Market News

CELLnROLL receives funding from EXIST Research Transfer

is.mpg.de

1m4 months ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 73 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Philipp Müller starts as Cyber Valley Max Planck Independent Research Group Leader

is.mpg.de

1m5 months ago

Research Papers

We are hiring a new Max Planck Research Group Leader at the MPI for Intelligent Systems in Stuttgart

is.mpg.de

1m4 months ago

Research Papers

More room for world class research

is.mpg.de

1m5 months ago

Research Papers

Telia agrees Swedish sovereign AI deal with Brookfield - Telecompaper

<a href="https://news.google.com/rss/articles/CBMingFBVV95cUxQY1ZCaEFJUVJLNFJUOWoyLVBqVGxCdjQ1QUJ6WEdPdVFvU0ZMVnZpZG9IY1YxaFlFOXhqME1lRXBWd2x5Tjg2bDdnaWlzQUxwQkZPWG1KU1RwN25BelRhREJyTXEwZWI2Vk9nTTlLdnI1RDFhQnpWa3hpa1ZwTHc1cGNNVmVtckFianM2YlNVZXJFZ3U2X2NmMl9BcUN4QQ?oc=5" target="_blank">Telia agrees Swedish sovereign AI deal with Brookfield</a> <font color="#6f6f6f">Telecompaper</font>

Google News AI Sweden

1m15 days ago