Models model announce available restrict arxiv github

MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation

arXiv cs.LGby [Submitted on 31 Mar 2026]April 2, 20262 min read1 views

arXiv:2604.00235v1 Announce Type: new Abstract: Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail th

View PDF HTML (experimental)

Abstract:Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: this https URL

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)

Cite as: arXiv:2604.00235 [cs.LG]

(or arXiv:2604.00235v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2604.00235

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jinghan Yao [view email] [v1] Tue, 31 Mar 2026 20:57:23 UTC (5,903 KB)

Original source

arXiv cs.LG

https://arxiv.org/abs/2604.00235

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelannounceavailable

ModelsLive

Google strongly implies the existence of large Gemma 4 models

In the huggingface card: Increased Context Window – The small models feature a 128K context window, while the medium models support 256K. Small and medium... implying at least one large model! 124B confirmed :P submitted by /u/coder543 [link] [comments]