Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessReply Signs Strategic Collaboration Agreement with AWS to Accelerate AI-Driven Cloud Transformation - Press Release HubGoogle News: Generative AIClaude Code Leak: 16 Lessons on Building Production-Ready AI SystemsAnalytics Vidhya2026世界杯Dev.to AItama96Dev.to AIThe All-in-One Local AI App: Chat + Images + Video Without the CloudDev.to AIClaude Code Just Fixed Terminal Flickering (How to Enable NO_FLICKER Mode)Dev.to AIHow to use a Claude Subscription in Cursor without paying for API tokensDev.to AIHow I Built a Desktop AI App with Tauri v2 + React 19 in 2026Dev.to AIAnthropic Source Code Leak: What Was Exposed & Why It Matters in AI Security - iZOOlogicGoogle News: ClaudeAI Agents in Production: Why Most Systems Break (And How to Fix Them)Dev.to AISome editors 'uploading confidential manuscripts to ChatGPT to read quickly', agent claims - The BooksellerGoogle News: ChatGPTWhere is the AI revolution at?Dev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessReply Signs Strategic Collaboration Agreement with AWS to Accelerate AI-Driven Cloud Transformation - Press Release HubGoogle News: Generative AIClaude Code Leak: 16 Lessons on Building Production-Ready AI SystemsAnalytics Vidhya2026世界杯Dev.to AItama96Dev.to AIThe All-in-One Local AI App: Chat + Images + Video Without the CloudDev.to AIClaude Code Just Fixed Terminal Flickering (How to Enable NO_FLICKER Mode)Dev.to AIHow to use a Claude Subscription in Cursor without paying for API tokensDev.to AIHow I Built a Desktop AI App with Tauri v2 + React 19 in 2026Dev.to AIAnthropic Source Code Leak: What Was Exposed & Why It Matters in AI Security - iZOOlogicGoogle News: ClaudeAI Agents in Production: Why Most Systems Break (And How to Fix Them)Dev.to AISome editors 'uploading confidential manuscripts to ChatGPT to read quickly', agent claims - The BooksellerGoogle News: ChatGPTWhere is the AI revolution at?Dev.to AI
Eigenvector logo
EIGENVECTOR

$\pi$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2511.10696v2 Announce Type: replace-cross Abstract: Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $\pi$-stride skips, and an adaptive — Dong Liu, Yanxuan Yu

View PDF HTML (experimental)

Abstract:Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $\pi$-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $\mathcal{O}(kL + \pi \log L)$ receptive field growth compared to $\mathcal{O}(kL)$ for RingAttention, where $k$ is the local window size, $\pi$ is the skip period, and $L$ is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3% lower perplexity than RingAttention while using 50% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as: arXiv:2511.10696 [cs.CL]

(or arXiv:2511.10696v2 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2511.10696

arXiv-issued DOI via DataCite

Submission history

From: Dong Liu [view email] [v1] Wed, 12 Nov 2025 09:09:13 UTC (686 KB) [v2] Sat, 28 Mar 2026 08:58:09 UTC (686 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
$\pi$-Atten…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 241 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers