Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
arXiv:2603.04427v4 Announce Type: replace-cross Abstract: Standard Transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(\log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically — Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang
View PDF HTML (experimental)
Abstract:Standard Transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O(\log N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch -- unlike Grouped-Query Attention (GQA) and Multi-Head Latent Attention (MLA), which must be designed into the architecture before pretraining. We factorize each key projection $W_K \approx A_{d \times r} B_{r \times d}$ via truncated singular value decomposition (SVD) (where $r$ is the chosen compression dimension), set $W_K' = A$ as the new key projection producing compact $r$-dimensional keys for the cache, and absorb $B^\top$ into the query projection ($W_Q' = W_Q B^\top$) at zero cost -- since queries are never cached. At the 7B scale, training from scratch with $r = d/4$ (where $d$ is the model dimension) matches full-attention perplexity ($9.24$ vs $9.25$ PPL after 20B tokens, mean over two seeds) while using 12% fewer parameters and training 8% faster. For existing models, SVD followed by QK fine-tuning (3 epochs, less than 1% of pretraining data) achieves 75% key cache savings at roughly 2% quality cost on both GPT-2 and Mistral-7B. The approach composes with GQA and quantization for up to $16\times$ combined key cache compression. For a 7B model serving a 128K context, factored keys save 25 GB of KV cache per user, enabling roughly 60% more concurrent users on identical hardware.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as: arXiv:2603.04427 [cs.LG]
(or arXiv:2603.04427v4 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2603.04427
arXiv-issued DOI via DataCite
Submission history
From: Yao Hengshuai [view email] [v1] Mon, 16 Feb 2026 23:45:39 UTC (16 KB) [v2] Tue, 17 Mar 2026 17:54:04 UTC (39 KB) [v3] Wed, 18 Mar 2026 15:05:32 UTC (39 KB) [v4] Sat, 28 Mar 2026 06:06:40 UTC (40 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Telia agrees Swedish sovereign AI deal with Brookfield - Telecompaper
<a href="https://news.google.com/rss/articles/CBMingFBVV95cUxQY1ZCaEFJUVJLNFJUOWoyLVBqVGxCdjQ1QUJ6WEdPdVFvU0ZMVnZpZG9IY1YxaFlFOXhqME1lRXBWd2x5Tjg2bDdnaWlzQUxwQkZPWG1KU1RwN25BelRhREJyTXEwZWI2Vk9nTTlLdnI1RDFhQnpWa3hpa1ZwTHc1cGNNVmVtckFianM2YlNVZXJFZ3U2X2NmMl9BcUN4QQ?oc=5" target="_blank">Telia agrees Swedish sovereign AI deal with Brookfield</a> <font color="#6f6f6f">Telecompaper</font>

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!