Self-Routing: Parameter-Free Expert Routing from Hidden States
arXiv:2604.00421v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results sho
View PDF HTML (experimental)
Abstract:Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced expert utilization, with about 17 % higher average normalized routing entropy and no explicit load-balancing loss. On ImageNet-1K with DeiT-S/16, Self-Routing also slightly improves over the corresponding learned-router MoE. These findings suggest that effective MoE routing can emerge from the hidden representation itself without requiring a separate learned router module.
Subjects:
Artificial Intelligence (cs.AI)
Cite as: arXiv:2604.00421 [cs.AI]
(or arXiv:2604.00421v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2604.00421
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Jama Hussein Mohamud [view email] [v1] Wed, 1 Apr 2026 03:05:20 UTC (72 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelannounce
The Hidden Cost of Manual Intervention in Digital Products
There is a cost your product team is almost certainly not tracking. It does not appear on your engineering budget. It does not show up in your infrastructure bills. It does not get flagged in your sprint retrospectives. And yet it compounds quietly across every release, every scaling event, and every new hire — until it becomes the single most significant drag on your platform’s ability to grow. The cost is manual intervention. Every time a team member has to step in to make a decision the system should have made, you are paying this cost. Every escalation, every workaround, every “just ask Sarah about that” is a withdrawal from an account most teams have never even opened. This article is about why that cost is so hard to see, how it compounds, and what it actually means to design it out
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!