Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessVultr says its Nvidia-powered AI infrastructure costs 50% to 90% less than hyperscalersThe New StackDeepseek v4 will reportedly run entirely on Huawei chips in a major win for China s AI independence pushThe DecoderHow to Make AI Work When You Don’t Have Big Tech MoneyTowards AIToshiba starts shipping SMR MAMR enterprise hard drives offering up to 34TB of storageTechSpotMIT created duplicate AI workers to tackle thousands of different tasks. The verdict? Most of the time AI is still just minimally sufficientFortune TechAlgorithms of Falsehood: The Challenges of Governing AI-Generated Disinformation - orfonline.orgGoogle News: Generative AIThe Cathedral, the Bazaar, and the Winchester Mystery HouseO'Reilly RadareM Client Adds Generative AI Features - Let's Data ScienceGoogle News: Generative AIThe fight on the right over AI - politico.euGNews AI USASources: Mercor asked professionals in fields like entertainment to sell their prior work materials for AI training, even if the IP could belong to ex-employers (Katherine Bindley/Wall Street Journal)TechmemeMarch Madness 2026: How to watch the Final FourEngadgetSony buys machine learning firm behind volumetric 3D images to level-up PlayStation tech - TweakTownGoogle News: Machine LearningBlack Hat USADark ReadingBlack Hat AsiaAI BusinessVultr says its Nvidia-powered AI infrastructure costs 50% to 90% less than hyperscalersThe New StackDeepseek v4 will reportedly run entirely on Huawei chips in a major win for China s AI independence pushThe DecoderHow to Make AI Work When You Don’t Have Big Tech MoneyTowards AIToshiba starts shipping SMR MAMR enterprise hard drives offering up to 34TB of storageTechSpotMIT created duplicate AI workers to tackle thousands of different tasks. The verdict? Most of the time AI is still just minimally sufficientFortune TechAlgorithms of Falsehood: The Challenges of Governing AI-Generated Disinformation - orfonline.orgGoogle News: Generative AIThe Cathedral, the Bazaar, and the Winchester Mystery HouseO'Reilly RadareM Client Adds Generative AI Features - Let's Data ScienceGoogle News: Generative AIThe fight on the right over AI - politico.euGNews AI USASources: Mercor asked professionals in fields like entertainment to sell their prior work materials for AI training, even if the IP could belong to ex-employers (Katherine Bindley/Wall Street Journal)TechmemeMarch Madness 2026: How to watch the Final FourEngadgetSony buys machine learning firm behind volumetric 3D images to level-up PlayStation tech - TweakTownGoogle News: Machine Learning
AI NEWS HUBbyEIGENVECTOREigenvector

Self-Routing: Parameter-Free Expert Routing from Hidden States

ArXiv CS.AIby Jama Hussein Mohamud, Drew Wagner, Mirco RavanelliApril 2, 20261 min read0 views
Source Quiz

arXiv:2604.00421v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results sho

View PDF HTML (experimental)

Abstract:Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced expert utilization, with about 17 % higher average normalized routing entropy and no explicit load-balancing loss. On ImageNet-1K with DeiT-S/16, Self-Routing also slightly improves over the corresponding learned-router MoE. These findings suggest that effective MoE routing can emerge from the hidden representation itself without requiring a separate learned router module.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2604.00421 [cs.AI]

(or arXiv:2604.00421v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.00421

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jama Hussein Mohamud [view email] [v1] Wed, 1 Apr 2026 03:05:20 UTC (72 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Self-Routin…modellanguage mo…announcestudyarxivfindingsArXiv CS.AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!