MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-Language Models
arXiv:2603.24984v2 Announce Type: replace Abstract: Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitatio — Dohwan Ko, Jinyoung Park, Seoung Choi, Sanghyeok Lee, Seohyun Lee, Hyunwoo J. Kim
View PDF HTML (experimental)
Abstract:Mixture-of-Experts (MoE) has emerged as an effective approach to reduce the computational overhead of Transformer architectures by sparsely activating a subset of parameters for each token while preserving high model capacity. This paradigm has recently been extended to Vision-Language Models (VLMs), enabling scalable multi-modal understanding with reduced computational cost. However, the widely adopted deterministic top-K routing mechanism may overlook more optimal expert combinations and lead to expert overfitting. To address this limitation and improve the diversity of expert selection, we propose MoE-GRPO, a reinforcement learning (RL)-based framework for optimizing expert routing in MoE-based VLMs. Specifically, we formulate expert selection as a sequential decision-making problem and optimize it using Group Relative Policy Optimization (GRPO), allowing the model to learn adaptive expert routing policies through exploration and reward-based feedback. Furthermore, we introduce a modality-aware router guidance that enhances training stability and efficiency by discouraging the router from exploring experts that are infrequently activated for a given modality. Extensive experiments on multi-modal image and video benchmarks show that MoE-GRPO consistently outperforms standard top-K routing and its variants by promoting more diverse expert selection, thereby mitigating expert overfitting and enabling a task-level expert specialization.
Comments: Accepted at CVPR 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.24984 [cs.CV]
(or arXiv:2603.24984v2 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.24984
arXiv-issued DOI via DataCite
Submission history
From: Dohwan Ko [view email] [v1] Thu, 26 Mar 2026 03:23:45 UTC (3,700 KB) [v2] Sun, 29 Mar 2026 06:59:34 UTC (3,695 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivHoward University and Google Research Enhance A.I. Speech Recognition of African American English - The Dig at Howard University
<a href="https://news.google.com/rss/articles/CBMiygFBVV95cUxQRTh4T2h6cVRsdEF2cjlkWGQyT2tWZnVTTmh4czBJV3ZpSmd1T1Z2eG5Ld1dvQWhNckpjRDItVEtiZ2hMdjBVLWJ0b0xTY0pieG82U0VibXFBLWVUN0tlQ3J1dzBFa2ZBekF1YXJPZlpHNGtkOWZjdWFCSlVTQTctcTNvcURtOER4MnhnYk1BQUt4WllmekE4WkVERTA4Wi1VcnFCY2xYSml6ak9GM1o1NmI0VWtXb2xERlVZVFNBTTQyQ1FBWThESk53?oc=5" target="_blank">Howard University and Google Research Enhance A.I. Speech Recognition of African American English</a> <font color="#6f6f6f">The Dig at Howard University</font>
Speech-to-Retrieval (S2R): A new approach to voice search - research.google
<a href="https://news.google.com/rss/articles/CBMijAFBVV95cUxQekN0T0VkREpJVGk0U25zMVcyX0VYV0V4eVRJY2ozVW02ampCVXFMRDJybk56blpMdWVhdkRsWWI2S19JemlYM3dHd2dBSkx0SWxtNnNfN18zcjBKLWVXN3JZUnVFdndndTBnSVlVSGhVdWwyS1V3TkRCSUJ5SnRkYXJBV1NfZWUwa3ByWA?oc=5" target="_blank">Speech-to-Retrieval (S2R): A new approach to voice search</a> <font color="#6f6f6f">research.google</font>
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
AI-driven digital manipulation ‘tested’ Dutch election integrity, researchers warn - EUobserver
<a href="https://news.google.com/rss/articles/CBMirwFBVV95cUxQcERTcUc5ZndxZ054endXTXNwTlhtYjRyLXBHWVJmRXloNV9JUUpFZnBrLUdDeUpSNklZRFJuUXl0bThIT2ZzbFd6ZU02TW9yaXBPbHducUlHaXVUbWprS0pla0JENkxpSkZfWW9vdTRvcjIzc2ZzWGF6ZmJPMXRVRkFnNmp5NWpLZTBIRk9LamF2RUtkdnQ2bFJXRVZMdVkxZWNHVUl1SzZZeE1JT3R3?oc=5" target="_blank">AI-driven digital manipulation ‘tested’ Dutch election integrity, researchers warn</a> <font color="#6f6f6f">EUobserver</font>
Why Drug Toxicity Can’t Be Predicted in Isolation — Building EIRION with Graph Neural Networks
How we built a graph neural network that finally sees the whole play — not just the audition Every year, drugs that passed early safety tests go on to harm people in ways nobody predicted. Not because the chemistry was wrong. Not because the researchers were careless. But because we kept evaluating drugs the way a talent agent judges an actor from a solo audition tape. Isolated. Out of context. No script. No co-stars. No stage. In real theatre, a performance is never just about one actor. It depends on who they share the stage with, which scene they appear in, what the story demands at that moment. A brilliant performer in the wrong play, surrounded by the wrong cast, in the wrong context — can still wreck the whole production. That is exactly how drug toxicity works. And that is exactly t
It's Not Smarter Models — It's Cheaper Memory: TurboQuant's Real Impact, Wall Street Panic & Academic Storm
<blockquote> <p><strong>One-line summary:</strong> TurboQuant is a genuinely important engineering breakthrough — but Google's marketing, academic ethics controversy, and Wall Street's overreaction made the story far more dramatic than the technology itself.</p> </blockquote> <h2> 0. What This Article Answers </h2> <p>Google Research published TurboQuant at ICLR 2026 (<a href="https://arxiv.org/abs/2504.19874" rel="noopener noreferrer">arXiv 2504.19874</a>), claiming 6x memory compression, 8x speedup, and zero accuracy loss for LLM KV caches.</p> <p>Then, in the same week:</p> <ol> <li>Global memory stocks lost over <strong>$90 billion</strong> in market cap</li> <li>An ETH Zürich researcher publicly accused the paper of <strong>academic plagiarism and experimental fraud</strong> </li> <li

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!