All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models
Reinforcement Learning enhances Vision-Language Model reasoning but suffers from diversity collapse; a new Multi-Group Policy Optimization method is proposed to encourage diverse thinking patterns. (7 upvotes on HuggingFace)
Published on Apr 1
Authors:
,
,
,
,
Abstract
Reinforcement Learning enhances Vision-Language Model reasoning but suffers from diversity collapse; a new Multi-Group Policy Optimization method is proposed to encourage diverse thinking patterns.
AI-generated summary
Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/
View arXiv page View PDF Project page GitHub 12 Add to collection
Get this paper in your agent:
hf papers read 2604.00479
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash
Models citing this paper 0
No model linking this paper
Cite arxiv.org/abs/2604.00479 in a model README.md to link it from this page.
Datasets citing this paper 0
No dataset linking this paper
Cite arxiv.org/abs/2604.00479 in a dataset README.md to link it from this page.
Spaces citing this paper 0
No Space linking this paper
Cite arxiv.org/abs/2604.00479 in a Space README.md to link it from this page.
Collections including this paper 0
No Collection including this paper
Add this paper to a collection to link it from this page.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv![[D] ICML 2026 Average Score](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-graph-nodes-a2pnJLpyKmDnxKWLd5BEAb.webp)
[D] ICML 2026 Average Score
Hi all, I’m curious about the current review dynamics for ICML 2026, especially after the rebuttal phase. For those who are reviewers (or have insight into the process), could you share what the average scores look like in your batch after rebuttal? Also, do tools like trackers https://papercopilot.com/statistics/icml-statistics/icml-2026-statistics/ reflect true Score distributions to some degree. Appreciate any insights. submitted by /u/Hope999991 [link] [comments]
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.


![[D] CVPR 2026 Travel Grant/Registration Waiver](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-circuit-gold-PMJWD5qsqGfXwX8w9a97Cb.webp)



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!