Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessReply Signs Strategic Collaboration Agreement with AWS to Accelerate AI-Driven Cloud Transformation - Press Release HubGoogle News: Generative AIClaude Code Leak: 16 Lessons on Building Production-Ready AI SystemsAnalytics Vidhya2026世界杯Dev.to AItama96Dev.to AIThe All-in-One Local AI App: Chat + Images + Video Without the CloudDev.to AIClaude Code Just Fixed Terminal Flickering (How to Enable NO_FLICKER Mode)Dev.to AIHow to use a Claude Subscription in Cursor without paying for API tokensDev.to AIHow I Built a Desktop AI App with Tauri v2 + React 19 in 2026Dev.to AIAnthropic Source Code Leak: What Was Exposed & Why It Matters in AI Security - iZOOlogicGoogle News: ClaudeAI Agents in Production: Why Most Systems Break (And How to Fix Them)Dev.to AISome editors 'uploading confidential manuscripts to ChatGPT to read quickly', agent claims - The BooksellerGoogle News: ChatGPTWhere is the AI revolution at?Dev.to AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessReply Signs Strategic Collaboration Agreement with AWS to Accelerate AI-Driven Cloud Transformation - Press Release HubGoogle News: Generative AIClaude Code Leak: 16 Lessons on Building Production-Ready AI SystemsAnalytics Vidhya2026世界杯Dev.to AItama96Dev.to AIThe All-in-One Local AI App: Chat + Images + Video Without the CloudDev.to AIClaude Code Just Fixed Terminal Flickering (How to Enable NO_FLICKER Mode)Dev.to AIHow to use a Claude Subscription in Cursor without paying for API tokensDev.to AIHow I Built a Desktop AI App with Tauri v2 + React 19 in 2026Dev.to AIAnthropic Source Code Leak: What Was Exposed & Why It Matters in AI Security - iZOOlogicGoogle News: ClaudeAI Agents in Production: Why Most Systems Break (And How to Fix Them)Dev.to AISome editors 'uploading confidential manuscripts to ChatGPT to read quickly', agent claims - The BooksellerGoogle News: ChatGPTWhere is the AI revolution at?Dev.to AI
Eigenvector logo
EIGENVECTOR

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2512.14080v2 Announce Type: replace-cross Abstract: Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with a higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO co — Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao

View PDF HTML (experimental)

Abstract:Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with a higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation, benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day, comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. On Blackwell GPUs, SonicMoE also achieves a 25% and 15% relative speedup on the forward and backward pass respectively compared to a highly optimized DeepGEMM baseline on OLMoE-sized 7B MoE models. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-K routing while maintaining similar downstream performance on Hopper GPUs. We open-source all our kernels.

Comments: Include the new Blackwell benchmark results

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2512.14080 [cs.LG]

(or arXiv:2512.14080v2 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2512.14080

arXiv-issued DOI via DataCite

Submission history

From: Wentao Guo [view email] [v1] Tue, 16 Dec 2025 04:39:10 UTC (2,977 KB) [v2] Thu, 26 Mar 2026 20:21:59 UTC (3,969 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
SonicMoE: A…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 247 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers