Live
Black Hat USAAI BusinessBlack Hat AsiaAI Business5 best practices to secure AI systemsAI NewsAI models fail at robot control without human-designed building blocks but agentic scaffolding closes the gap - the-decoder.comGoogle News - AI roboticsTikTok's 'hidden game' shows it wants even more of our timeCreative Bloq AI DesignUS crude tops US$110, Wall Street falls after Trump vows more Iran attacksSCMP Tech (Asia AI)Unlocking the promise of smart factories: Advanced analytics powered by 5G provides a road map to the futureTech Monitor1.13.0a7CrewAI ReleasesCalls to Regulate Smart Glasses Are Officially DeafeningGizmodoUMW Inaugural AI Expert-in-Residence Shares Insight on Technology’s ‘Tremendous’ Impact - University of Mary WashingtonGoogle News: AIAmazon vs. Apple: Which Is the Better Artificial Intelligence (AI) Stock to Buy Today? - The Motley FoolGoogle News: AITesla’s cheaper vehicles aren’t helping its declining salesTechCrunch AIReviewing the evidence on psychological manipulation by Bots and AILessWrong AIv0.20.0-rc0: Merge pull request #42 from ollama/jmorganca/gemma4-ggml-improvementsOllama ReleasesBlack Hat USAAI BusinessBlack Hat AsiaAI Business5 best practices to secure AI systemsAI NewsAI models fail at robot control without human-designed building blocks but agentic scaffolding closes the gap - the-decoder.comGoogle News - AI roboticsTikTok's 'hidden game' shows it wants even more of our timeCreative Bloq AI DesignUS crude tops US$110, Wall Street falls after Trump vows more Iran attacksSCMP Tech (Asia AI)Unlocking the promise of smart factories: Advanced analytics powered by 5G provides a road map to the futureTech Monitor1.13.0a7CrewAI ReleasesCalls to Regulate Smart Glasses Are Officially DeafeningGizmodoUMW Inaugural AI Expert-in-Residence Shares Insight on Technology’s ‘Tremendous’ Impact - University of Mary WashingtonGoogle News: AIAmazon vs. Apple: Which Is the Better Artificial Intelligence (AI) Stock to Buy Today? - The Motley FoolGoogle News: AITesla’s cheaper vehicles aren’t helping its declining salesTechCrunch AIReviewing the evidence on psychological manipulation by Bots and AILessWrong AIv0.20.0-rc0: Merge pull request #42 from ollama/jmorganca/gemma4-ggml-improvementsOllama Releases
AI NEWS HUBbyEIGENVECTOREigenvector

CoDec: Prefix-Shared Decoding Kernel for LLMs

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2505.17694v2 Announce Type: replace Abstract: Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a memory-intensive process requiring heavy memory access on the key-value (KV) cache of the prefixes. Therefore, in this paper, we explore the potential of prefix-sharing in the attention computation of the decode stage. However, the tree structure of the prefix-sharing mechanism presents significant challenges for a — Zhibin Wang, Rui Ning, Chao Fang, Zhonghui Zhang, Xi Lin, Shaobo Ma, Mo Zhou, Xue Li, Zhongfeng Wang, Chengying Huan, Rong Gu, Kun Yang, Guihai Chen, Sheng Zhong, Chen Tian

Authors:Zhibin Wang, Rui Ning, Chao Fang, Zhonghui Zhang, Xi Lin, Shaobo Ma, Mo Zhou, Xue Li, Zhongfeng Wang, Chengying Huan, Rong Gu, Kun Yang, Guihai Chen, Sheng Zhong, Chen Tian

View PDF HTML (experimental)

Abstract:Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a memory-intensive process requiring heavy memory access on the key-value (KV) cache of the prefixes. Therefore, in this paper, we explore the potential of prefix-sharing in the attention computation of the decode stage. However, the tree structure of the prefix-sharing mechanism presents significant challenges for attention computation in efficiently processing shared KV cache access patterns while managing complex dependencies and balancing irregular workloads. To address the above challenges, we propose a dedicated attention kernel to combine the memory access of shared prefixes in the decoding stage, namely CoDec. CoDec delivers two key innovations: a novel shared-prefix attention kernel that optimizes memory hierarchy and exploits both intra-block and inter-block parallelism, and a comprehensive workload balancing mechanism that efficiently estimates cost, divides tasks, and schedules execution. Experimental results show that CoDec achieves an average $1.9\times$ speedup and $120.9\times$ memory access reduction compared to the state-of-the-art FlashDecoding kernel regarding attention computation in the decode stage and $3.8\times$ end-to-end time per output token compared to the vLLM.

Subjects:

Machine Learning (cs.LG)

Cite as: arXiv:2505.17694 [cs.LG]

(or arXiv:2505.17694v2 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2505.17694

arXiv-issued DOI via DataCite

Submission history

From: Rui Ning [view email] [v1] Fri, 23 May 2025 10:03:28 UTC (652 KB) [v2] Sat, 28 Mar 2026 10:14:51 UTC (898 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
CoDec: Pref…researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 141 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers