SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
arXiv:2512.14080v2 Announce Type: replace-cross Abstract: Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with a higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO co — Wentao Guo, Mayank Mishra, Xinle Cheng, Ion Stoica, Tri Dao
View PDF HTML (experimental)
Abstract:Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with a higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation, benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day, comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. On Blackwell GPUs, SonicMoE also achieves a 25% and 15% relative speedup on the forward and backward pass respectively compared to a highly optimized DeepGEMM baseline on OLMoE-sized 7B MoE models. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-K routing while maintaining similar downstream performance on Hopper GPUs. We open-source all our kernels.
Comments: Include the new Blackwell benchmark results
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as: arXiv:2512.14080 [cs.LG]
(or arXiv:2512.14080v2 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2512.14080
arXiv-issued DOI via DataCite
Submission history
From: Wentao Guo [view email] [v1] Tue, 16 Dec 2025 04:39:10 UTC (2,977 KB) [v2] Thu, 26 Mar 2026 20:21:59 UTC (3,969 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Artificial intelligence assisted colorectal lesion detection in private practices a randomized controlled study
npj Digital Medicine, Published online: 01 April 2026; doi:10.1038/s41746-026-02576-8 Artificial intelligence assisted colorectal lesion detection in private practices a randomized controlled study

Robust Multidimensional Chinese Remainder Theorem (MD-CRT) with Non-Diagonal Moduli and Multi-Stage Framework
arXiv:2604.00995v1 Announce Type: new Abstract: The Chinese remainder theorem (CRT) provides an efficient way to reconstruct an integer from its remainders modulo several integer moduli, and has been widely applied in signal processing and information theory. Its multidimensional extension (MD-CRT) generalizes this principle to integer vectors and integer matrix moduli, enabling reconstruction in multidimensional signal processing scenarios. However, since matrices are generally non-commutative, the multidimensional extension introduces new theoretical and algorithmic challenges. When all matrix moduli are diagonal, the system is equivalent to applying the one-dimensional CRT independently along each dimension. This work first investigates whether non-diagonal (non-separable) moduli offer

Spatial Upper Bound of Radiated Power in Active Antenna Systems
arXiv:2604.00846v1 Announce Type: new Abstract: The assessment of unwanted radiated emissions from Active Antenna Systems (AAS) has become a critical issue in adjacent-band coexistence scenarios. In this paper, we establish the existence of a deterministic spatial upper bound on the radiated power of active antenna arrays. We show that the maximum radiated power always occurs in the boresight direction, irrespective of frequency or signal nature (useful signal, nonlinear distortion, or noise), or instantaneous beamforming configuration, thereby defining a conservative spatial upper bound whose angular envelope is solely determined by the elementary radiating building block of the antenna architecture, i.e., the element or sub-array radiation pattern. Starting from a two-element array with

Learning Laplacian Forms for Graph Signal Processing via the Deformed Laplacian
arXiv:2604.00728v1 Announce Type: new Abstract: Learning the graph Laplacian from observed data is one of the most investigated and fundamental tasks in Graph Signal Processing (GSP). Different variants of the Laplacian, such as the combinatorial, signless or signed Laplacians have been considered depending on the type of features to be extracted from the data. The main contribution of this paper is the introduction of a parametric Laplacian, called the deformed Laplacian, defined as a quadratic matrix polynomial that provides a parametric dictionary for graph signal processing. The deformed Laplacian can be interpreted as the generator of a parametric linear reaction-diffusion dynamics on graphs, capturing the interplay between diffusive coupling and nodal reaction effects. It is a parame
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!