Models model language model announce study arxiv findings

Self-Routing: Parameter-Free Expert Routing from Hidden States

ArXiv CS.AIby Jama Hussein Mohamud, Drew Wagner, Mirco RavanelliApril 2, 20261 min read0 views

arXiv:2604.00421v1 Announce Type: new Abstract: Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results sho

View PDF HTML (experimental)

Abstract:Mixture-of-Experts (MoE) layers increase model capacity by activating only a small subset of experts per token, and typically rely on a learned router to map hidden states to expert assignments. In this work, we ask whether a dedicated learned router is strictly necessary in the MoE settings we study. We propose Self-Routing, a parameter-free routing mechanism that uses a designated subspace of the token hidden state directly as expert logits, eliminating the router projection entirely while leaving the rest of the MoE layer unchanged. We evaluate Self-Routing on GPT-2-scale language modeling and ImageNet-1K classification by comparing it against a standard learned router, random-routing baselines, and dense non-MoE baselines. Our results show that Self-Routing remains competitive with the learned-router baseline while removing all dedicated routing parameters, and yields more balanced expert utilization, with about 17 % higher average normalized routing entropy and no explicit load-balancing loss. On ImageNet-1K with DeiT-S/16, Self-Routing also slightly improves over the corresponding learned-router MoE. These findings suggest that effective MoE routing can emerge from the hidden representation itself without requiring a separate learned router module.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2604.00421 [cs.AI]

(or arXiv:2604.00421v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.00421

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jama Hussein Mohamud [view email] [v1] Wed, 1 Apr 2026 03:05:20 UTC (72 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2604.00421

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelannounce

ModelsFresh

DeepSeek V4 model to use Huawei chips? But, can Huawei truly match Nvidia in power and scale - The Economic Times

DeepSeek V4 model to use Huawei chips? But, can Huawei truly match Nvidia in power and scale The Economic Times

GNews AI Huawei

1mabout 2 hours ago

ModelsFresh

DeepSeek's V4 model will run on Huawei chips, The Information reports - reuters.com

DeepSeek's V4 model will run on Huawei chips, The Information reports reuters.com

GNews AI Huawei

1mabout 5 hours ago

ProductsLive

The Hidden Cost of Manual Intervention in Digital Products

There is a cost your product team is almost certainly not tracking. It does not appear on your engineering budget. It does not show up in your infrastructure bills. It does not get flagged in your sprint retrospectives. And yet it compounds quietly across every release, every scaling event, and every new hire — until it becomes the single most significant drag on your platform’s ability to grow. The cost is manual intervention. Every time a team member has to step in to make a decision the system should have made, you are paying this cost. Every escalation, every workaround, every “just ask Sarah about that” is a withdrawal from an account most teams have never even opened. This article is about why that cost is so hard to see, how it compounds, and what it actually means to design it out

Towards AI

10mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Self-Routing: Parameter-Free Expert Routing from Hidden States

Submission history

Daily AI Digest

More about

DeepSeek V4 model to use Huawei chips? But, can Huawei truly match Nvidia in power and scale - The Economic Times

DeepSeek's V4 model will run on Huawei chips, The Information reports - reuters.com

The Hidden Cost of Manual Intervention in Digital Products

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - WSJ

DeepSeek V4 model to use Huawei chips? But, can Huawei truly match Nvidia in power and scale - The Economic Times

DeepSeek's V4 model will run on Huawei chips, The Information reports - reuters.com