MAC-Attention: a Match-Amend-Complete Scheme for Fast and Accurate Attention Computation
arXiv:2604.00235v1 Announce Type: new Abstract: Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail th
View PDF HTML (experimental)
Abstract:Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression, which lowers fidelity, or selection/eviction, which restricts what remains accessible, and both can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity- and access-preserving alternative that accelerates decoding by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with fresh attention computed on the KV tail through a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of context length. The method is model-agnostic and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), compared to the latest FlashInfer library, MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups, up to 2.6x end-to-end, while maintaining full-attention quality. By reusing computation, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available here: this https URL
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as: arXiv:2604.00235 [cs.LG]
(or arXiv:2604.00235v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2604.00235
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Jinghan Yao [view email] [v1] Tue, 31 Mar 2026 20:57:23 UTC (5,903 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelannounceavailable
Google strongly implies the existence of large Gemma 4 models
In the huggingface card: Increased Context Window – The small models feature a 128K context window, while the medium models support 256K. Small and medium... implying at least one large model! 124B confirmed :P submitted by /u/coder543 [link] [comments]
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Google strongly implies the existence of large Gemma 4 models
In the huggingface card: Increased Context Window – The small models feature a 128K context window, while the medium models support 256K. Small and medium... implying at least one large model! 124B confirmed :P submitted by /u/coder543 [link] [comments]





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!